Generate artificial datasets that illustrate the assumptions and characteristics of different methods.

1 vue (au cours des 30 derniers jours)
Generate artificial datasets that illustrate the assumptions and characteristics of different methods. Datasets
are ideally bidimensional. Among other dataset properties you can experiment with:
• number of cases
• number of classes
• proportion of classes
• distribution of points within each class (shape of point clouds)
• shape of the border between the class regions, from linear to whatever
• level of noise
• level of overlap between the classes
Consider the methods: logistic regression, LDA, QDA, Decision Tree without pruning, Decision Tree with a
maximum depth of 2, SVM linear, SVM RBF
Find for each of the listed methods a dataset where the respective assumptions are met and assumptions
of the other methods are not met (if possible). In other words a dataset where that method is hard to
beat using cross validation. Explain why this dataset is appropriate for the method. Suggestion:
use datasets with 2 predictors and 2 classes that can also be visualized. This is not mandatory.

Réponses (1)

SOUMNATH PAUL
SOUMNATH PAUL le 7 Mai 2024
Hi @Gabriel,
Below, I'll outline how to generate datasets for each method listed (Logistic Regression, LDA, QDA, Decision Trees, and SVMs) and explain why these datasets are particularly suited to each method. We will create scenarios where each method's assumptions are met, and which is specific to the model used
  • Logistic Regression: Best when data is linearly separable.
% Dataset with identical covariance matrices
rng(1); % For reproducibility
X = [randn(100,2)*0.75+ones(100,2); randn(100,2)*0.75-ones(100,2)];
Y = [ones(100,1); zeros(100,1)];
  • LDA: The ideal data set are classes with identical covariance matrices and means that differ
% Dataset with identical covariance matrices
rng(2);
X = [mvnrnd([1 2], [1 0.5; 0.5 1], 100); mvnrnd([-1 -2], [1 0.5; 0.5 1], 100)];
Y = [ones(100,1); zeros(100,1)];
  • QDA: Works best in classes with distinct covariance matrices
% Dataset with distinct covariance matrices
rng(3);
X = [mvnrnd([1 2], [1 0.5; 0.5 1], 100); mvnrnd([-1 -2], [2 -1; -1 2], 100)];
Y = [ones(100,1); zeros(100,1)];
  • Decision Tree (Maximum depth 2): Works best in complex decision boundaries with interactions and heterogeneity
% Complex boundaries dataset
[X, Y] = meshgrid(linspace(-2, 2, 20), linspace(-2, 2, 20));
X = [X(:), Y(:)];
Y = xor(X(:,1) > 0, X(:,2) > 0);
Y = double(Y);
Hope it helps!
Regards Soumnath

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by