3 views (last 30 days)

Show older comments

Hi all,

Please help me. Please. This research student will thank you immensely.

Here's the scenario: I have two populations (main and sub). The sub population is 10% of the main population. The mean (mu) and sigma(std) of these two populations can change, but have certain predefined boundaries. Right now, I am only changing the sigma of the main population and keeping the sigma of the sub population, and the mean of both populations constant.

What I want: I want to use fitgmdist/GMModels/Gaussian distribution to be able to identify that there are indeed two components in that set of data. In other words, I am trying to see for what sigma_main values does the model ouput two components (numComponents = 2). Right now, I have a model that iterates through my sigmamain values (0.1:0.05:1) and goes through varying GMModels to find the one with the better AIC (least error) and outputs the numComponents for every sigma_main value, and then sigma_main vs numComponents is plotted. At least, that's what I think.

Here's my code:

%% define known constants

% rng(0,'twister')

n_main = 100000;

n_sub = 10000;

%sigma_main = 0.1;

sigma_sub = 0.1;

mu_main = 1;

mu_sub =3;

%% change sigma_main values and run model

sigmamain_val = 0.1:0.05:1;

outputData = zeros(length(sigmamain_val), 2); % create an output array for (length sigma_sub),numComponents

for i = 1:length(sigmamain_val)

sigmamain_pos = randi(length(sigmamain_val));

sigma_main = sigmamain_val(sigmamain_pos);

y = sigma_main.*randn(n_main,1) + mu_main; %10^6 SKBr3 cells

y2 = sigma_sub.*randn(n_sub,1) + mu_sub; %100,000 MDA MB 231

C = cat(1, y, y2);

AIC = zeros(1,4); % create an ouput array for AIC

GMModels = cell(1,4);

options = statset('MaxIter',00); % add optitons here to better/refine the model

for k = 1:4

GMModels{k} = fitgmdist(C,k);

AIC(k)= GMModels{k}.AIC;

end

[minAIC,numComponents] = min(AIC);

outputData(i,1) = sigma_main;

outputData(i,2) = numComponents;

%BestModel = GMModels{numComponents}

BestModel = GMModels{outputData(i,2)};

end

%% plot sigma main vs numComponents

figure(3)

plot(outputData(:,1),outputData(:,2))

xlabel('sigma main')

ylabel('numComponents')

Btw: the idea for this model with AIC is based off: https://www.mathworks.com/help/stats/fitgmdist.html

Here's the problem:

For starters: The code for y & y2, which are for my main and sub population respectively, only take into account the very last sigma value.

A bigger problem is that sometimes, my code picks a sigma_main value twice (i.e 0.8) and the number of components is sometimes different even though they are the same sigma_main value.

Also, I would like to ouput a "BestModel" or 1x1 gmdistrinution for every sigma_main value instead of just the last one and I don't know how to incorporate that in my for loop.

What I think: I need to go through sigma_main values in order (i.e only write "sigma_main = i" inside the for loop) so that values don't get repeated, but then my y and y2 cannot b concatenated (C = cat(1, y,y2) because my sigma_main value will be 19x1 instead of just one number.

To outut a BestModel for every sigma_main, I need to create an empty/zeroes array before but I don't know ehat type to do that for.

Please, please help me. I have stayed up two nights doing this and I really don't know how to move forward. It might be really simple, but I might be overthinking it.

Jeff Miller
on 29 May 2020

I don't really follow what you are trying to do, but maybe some of these comments will help a bit:

I. Use a cell array to output a BestModel for every sigma_main. So, before the for loop, you need

BestModel = cell(numel(sigma_main),1); % pre-allocate array to hold fitgmdist outputs

and then inside the for loop

BestModel{i} = GMModels{outputData(i,2)}; % save the best model for later

II. I don't understand this line:

sigmamain_pos = randi(length(sigmamain_val));

Why don't you just go through the different sigma values in numerical order checking each one once?

III. It's no surprise that "the number of components is sometimes different even though they are the same sigma_main value", because you are generating data randomly each time through. Some random data sets might be more consistent with 2 components, some with 3, even though they were generated from the same underlying conditions.

IV. "The code for y & y2, which are for my main and sub population respectively, only take into account the very last sigma value." This is because you overwrite the previous y & y2 each time you go through the for loop. I'm not sure why you want to have these random numbers for all iterations of the loop, but you could save them in another cell array if you wanted.

Jeff Miller
on 30 May 2020

It's better to ask a new question. More people look at questions that have not yet been answered

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!