BUG (#2)? kmeans is sensitive to rows (points) order

2 vues (au cours des 30 derniers jours)
micholeodon
micholeodon le 12 Mar 2019
Modifié(e) : micholeodon le 12 Mar 2019
Dear All,
I have noticed that kmeans gives different results for different points order !
This does not make any sense in my opinion.
I guess row order in matrix should have no impact on centroids location if random generator is set to fixed seed.
Anybody can explain that?
clear; close all; clc;
nPoints = 100;
nDimensions = 2;
nClusters = 3;
data = rand(nPoints,nDimensions) % points from uniform distr.
scatter(data(:,1), data(:,2), 'b')
rndGenSeed = 1;
%% cluster unshuffled data
rng(rndGenSeed) % set random generator's seed
[~, clusters] = kmeans(data, nClusters)
hold on
scatter(clusters(:,1), clusters(:,2), 'rv') % red triangles
hold off
%% cluster shuffled data
rng(rndGenSeed) % set random generator's seed - same seed
[~, clusters_sh] = kmeans(sortrows(data), nClusters)
hold on
scatter(data(:,1), data(:,2), 'k*') % control - plot shuffeled points - they should be ion same spots
scatter(clusters_sh(:,1), clusters_sh(:,2), 'gv') % these points should cover red triangles
hold off
grid on
  1 commentaire
micholeodon
micholeodon le 12 Mar 2019
Modifié(e) : micholeodon le 12 Mar 2019
I think I have some clue, but it would be highly recommended that somebody from MathWorks Team verify it.
So my clue is this:
  1. Kmeans needs to choose some initial clusters positions. It can select randomly k INPUT POINTS to start.
  2. If you set rng(seed), seed=const. you will always get SAME row indices from data matrix as a starting cluster position.
  3. If you shuffle input data (input points locations are the same, only order in data structure is shuffled), even if you set rng(seed), seed=const. , you will get SAME row indices, BUT points under that indices are DIFFERENT !
  4. That means that kmeans will converge differently for shuffled input data points.
This would explain also my puzzle in another question: https://www.mathworks.com/matlabcentral/answers/448832-bug-evalclusters-is-sensitive-to-rows-points-order
What do you think MathWorks experts? :) Does k-means select input data points as a starting centroids locations?

Connectez-vous pour commenter.

Réponses (0)

Catégories

En savoir plus sur Cluster Analysis and Anomaly Detection dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by