Why is variance high for high K value in this KNN code?

Question

0 votes

Hello,

Long post, please bear with me

I have a matlab dataset (dataset.mat) whose size is 280*3. The last column is the labels. There are total 3 classes (1, 2 and 3). I am implementing KNN on this dataset. Basically, I want to calculate the classification error, the mean and the variance of the classification error over multiple (random, but even) splits. From the plot I want to determine how k value affects the mean and the variance of the classification error. Now, I understand the concept of Bias and Variance. I also know that as the k value increases, the bias will increase and variance will decrease. When K = 1 the bias will be 0, however, when it comes to new data (in test set), it has higher chance to be an error, which causes high variance. But, the variance isnt decreasing in my plot (please see the attachment)

My code looks like this:

%% Loading the dataset
clear all
clc
load('dataset.mat');
%% Calculating the mean, variance and classification error for multiple splits
m = []; % empty list to store the mean of the classification error
variance = []; % empty list to store the variance of the classification error
error = []; % empty list to store the classification error
for k= 1:20 % different k values
    
    error = [];
    
    for j= 1:10 % This for loop is for random split (note: each time it is split evenly i.e. 50% into a training set and rest in a test set). 
        
        
        % dataset is split evenly (i.e. 50%), but randomly in to a training set and a test set all 10 times
        
        N = size(knn_samples,1);
        idx = randperm(N);
        
        train = knn_samples(idx(1:round(N*0.5)),:);
        test = knn_samples(idx(round(N*0.5)+1:end),:);
        X_train = train(:,1:2); % size 140*2
        y_train = train(:,3); % size 140*1
        X_test = test(:,1:2); % size 140*2
        y_test = test(:,3); % size 140*1
       
        Model = fitcknn(X_train,y_train,'NumNeighbors',k,'Standardize',1); % KNN model
        
        rloss = resubLoss(Model); % the classification loss by resubstitution
        
        [label_test,score_test,cost_test] = predict(Model,X_test);
        L = loss(Model,X_test,y_test); %how well the model classifies the data 
        C_test = confusionmat(y_test,label_test); % confusion matrix 
        idx = find(C_test ~= diag(C_test)); %to find the index of the off diagonal entries of confusion matrix i.e. classification error
        off_diag = sum(C_test(idx)); %to calculate the total value of off diagonal entries
        accuracy = sum(diag(C_test)/sum(C_test(:)));
        
        errorClass = sum(label_test ~= y_test)/length(y_test);
        error = [error, errorClass]; % classification error
        
    end
    
    m = [m, mean(error)]; %mean of the classification error
    variance = [variance, var(error)]; % variance of the classification error
    
end
figure(1)
hold on
colormat1 = y_test;
scatter(X_test(:, 1), X_test(:, 2), [], colormat1); 
l = (label_test ~= y_test); % specify wrong predictions
colormat2 = label_test(l);
mkr = 'x';
scatter(X_test(l, 1), X_test(l, 2), [], colormat2, mkr); % mark the wrong predictions
k = 1:20;
 
figure(2)
plot(k, m, 'b')
xlabel('K values')
ylabel('Mean')
title('Mean of the classification error') % over multiple splits
figure(3)
plot(k, predictiveVariance, 'k')
xlabel('K values')
ylabel('Variance')
title('Variance of the classification error')

Maybe there is a compact way of writing this code, but I am a beginner. This could be a very very basic quetion, but I am unable to figure it out. I looked online for the solution, but I didn't find anything. Almost every site talks about Bias and Variance trade-off, but I didn't find any code example or a reason on why the variance could be increasing with increasing value of k. May be there is a small glitch in the code which I am unable to figure it out. I have given up on finding solution on my own, hence looking for solution in the Matlab community. You can also suggest a better way to write this code or any link which could give me a solution for this.

Note: Please also have a look at the variance value. Is it too small (it is in 10^-3 range)

Thank you very much

2 commentaires
Afficher Aucune Masquer Aucune

Ganesh Regoti le 24 Juil 2019

Can you provide a section of dataset to test on the model?

Vanditha Rao le 28 Juil 2019

dataset.mat

@Ganesh Regoti: What do you mean by the section of dataset? Do you want me to attach the dataset? I have attached the dataset.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

llueg le 24 Juil 2019

0 votes

I agree more information on the data would be helpful. Also, since your data set is fairly small, you can probably do more than 10 (maybe a hundred) different splits for each k, just to get a more accurate average. If the current trend is still there, it's probably due to properties specific to your data.

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Answer 2

Ganesh Regoti le 29 Juil 2019

Modifié(e) : Ganesh Regoti le 29 Juil 2019

0 votes

In KNN-classification, variance need not be decreasing as the K value increases. Usually it is ‘U’- shape and we find out the optimal point.

There might be certain predictors which contribute more for the classification. If those highly contributing predictors vary as such

Constant: There will be not much difference in variance graph for the entire data set.

Values vary and reach an optimum at certain point: Variance also varies accordingly (probably decreasing with increase in K value) but once optimal point is reached, it might start increasing.

So, I think that in your case optimum point is reached in the process, and continuing the process lead to increase in variance.

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Why is variance high for high K value in this KNN code?

2 commentaires
Afficher Aucune Masquer Aucune

Réponses (2)

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Catégories

Tags

Community Treasure Hunt

Why is variance high for high K value in this KNN code?

2 commentaires Afficher Aucune Masquer Aucune

Réponses (2)

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Catégories

Tags

Voir également

Community Treasure Hunt

2 commentaires
Afficher Aucune Masquer Aucune

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens