How to Cluster Dataset and remove outlier in MATLAB

I understand that you want to cluster the 4-feature dataset and remove the outliers from the dataset. This task can be carried out using the following workflow:

Determine the optimal number of clusters: The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters and looking for the "elbow" point where the rate of decrease sharply changes. This point is often considered a good choice for the number of clusters.
Perform K-means clustering: After determining the optimal number of clusters, perform k-means clustering.
Removing outliers: Outliers can be detected and removed based on their distance from the centroid of their assigned cluster. A common approach is to remove points that are farthest from the centroid beyond a certain threshold.

Please refer to the below code snippet that illustrates the above workflow:

data = Dataset;
wcss = [];
for k = 1:10 % Test up to 10 clusters
    [idx, C, sumd] = kmeans(data, k, 'Replicates', 10);
    wcss(k) = sum(sumd);
end
plot(1:10, wcss);
xlabel('Number of clusters');
ylabel('WCSS');
title('Elbow Method');
optimalK = % the optimal number of clusters you determined
[idx, C, sumd] = kmeans(data, optimalK, 'Replicates', 10);
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);
    centroid = C(i, :);
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Hope it helps!

2 commentaires
Afficher AucuneMasquer Aucune

Med Future le 23 Avr 2024

Modifié(e) : Walter Roberson le 24 Avr 2024

Ouvrir dans MATLAB Online

Question.mat

@Sai Pavan

I have implement the code you shared with my code. But still there is an error Arrays have incompatible sizes for this operation. I have attached the dataset and the code below. Please modified the code for that. As i know the ground truth there should be only 1 cluster the remaining are the noise. Based on the distance calculation

load Question
dataset1=data(:,[2 4]);
% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);
dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
   clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);
    centroid = C(i, :);
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));
end
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

Med Future le 24 Avr 2024

@Image Analyst @Walter Roberson Can you please look it how to solve this issue?

Connectez-vous pour commenter.

Answer 2

Walter Roberson le 24 Avr 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/1894680-how-to-cluster-dataset-and-remove-outlier-in-matlab#answer_1447371

Déplacé(e) : Walter Roberson le 24 Avr 2024

Ouvrir dans MATLAB Online

Question.mat

load Question
dataset1=data(:,[2 4]);

dataset1 is created from 2 columns of data

% Step 1: Identify and remove outliers
freq_outliers = isoutlier(dataset1(:, 1));
pw_outliers = isoutlier(dataset1(:, 2));
outliers = freq_outliers | pw_outliers;
% Step 2: Remove rows with outliers from all columns
dataset1_no_outliers = dataset1(~outliers, :);
pdw_no_outliers = data(~outliers, :);
% Now, continue with your existing code using 'dataset1_no_outliers'
eva = evalclusters(dataset1_no_outliers, 'kmeans', 'silhouette', 'KList', [1:8]);
%eva = evalclusters(dataset1, 'kmeans', 'silhouette', 'KList', [1:8]);
K = eva.OptimalK;
[idx,C,sumdist] = kmeans(dataset1,K);

C is created from dataset1 so it has two columns

dataset=data;
dataset_idx=zeros(length(dataset),5);
dataset_idx=dataset(:,1:5);
dataset_idx(:,6)=idx;
clusters = cell(K,1);
for i = 1:K
   clusters{i} = dataset_idx(dataset_idx(:,6) == i,:);
end
cluster_assignments=idx;
optimalK=K
optimalK = 4
% Calculate distances of each point to its cluster centroid
distances = zeros(size(data, 1), 1);
for i = 1:optimalK
    clusterPoints = data(idx == i, :);

data has 6 columns, so clusterPoints has 6 columns

centroid = C(i, :);

centroid is created from C so it has two columns

    whos clusterPoints centroid
    distances(idx == i) = sqrt(sum((clusterPoints - centroid).^2, 2));

You are trying to subtract something with 2 columns from something with 6 columns, which is an error

end
  Name                 Size            Bytes  Class     Attributes

  centroid             1x2                16  double              
  clusterPoints      177x6              8496  double              
Arrays have incompatible sizes for this operation.
threshold = prctile(distances, 95); % Define a threshold for outlier removal, e.g., 95th percentile of distances
outliers = distances > threshold; % Identify outliers
% Remove outliers
dataCleaned = data(~outliers, :);
idxCleaned = idx(~outliers);

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Med Future le 25 Avr 2024

@Walter Roberson Thank you for explaining it that much. Basically the problem is to reassign the clusters which are already made by K-means. means i want to remove the outliers. as you see the solution the each distance of each centroid from the clusterpoints are recalculated by facing the error. can you please help me to solve this problem.

Connectez-vous pour commenter.

How to Cluster Dataset and remove outlier in MATLAB

2 commentaires
Afficher AucuneMasquer Aucune

Réponses (2)

2 commentaires
Afficher AucuneMasquer Aucune

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

How to Cluster Dataset and remove outlier in MATLAB

2 commentaires Afficher AucuneMasquer Aucune

Réponses (2)

2 commentaires Afficher AucuneMasquer Aucune

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

2 commentaires
Afficher AucuneMasquer Aucune

2 commentaires
Afficher AucuneMasquer Aucune

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens