How to take a random sample of each column?

I have data from a file with 25 columns and 9000 rows. Actually the problem is that I would like to have everything on an .m format and not on a .mat one but having too many rows, this is not possible (when I try to save the file it is said that the files that are too big will be saved as a .mat). How can I get just a random sample for each column (I mean around 150-200 rows). Thank you.

 Réponse acceptée

Image Analyst
Image Analyst le 13 Oct 2015

0 votes

If you have the Statistics and Machine Learning Toolbox, you can use the randsample() function:
y = randsample(n,k) returns a k-by-1 vector y of values sampled uniformly at random, without replacement, from the integers 1 to n.
y = randsample(population,k) returns a vector of k values sampled uniformly at random, without replacement, from the values in the vector population. The orientation of y (row or column) is the same as population.
y = randsample(n,k,replacement) or y = randsample(population,k,replacement) returns a sample taken with replacement if replacement is true, or without replacement if replacement is false. The default is false.
y = randsample(n,k,true,w) or y = randsample(population,k,true,w) returns a weighted sample taken with replacement, using a vector of positive weights w, whose length is n. The probability that the integer i is selected for an entry of y is w(i)/sum(w). Usually, w is a vector of probabilities. randsample does not support weighted sampling without replacement.
y = randsample(s,...) uses the stream s for random number generation. s is a member of the RandStream class. Default is the MATLAB® default random number stream.

Plus de réponses (2)

Thorsten
Thorsten le 14 Oct 2015
data=importdata('wine.txt');
nRows=150;
randomlySelected=data(randsample(size(data,1), nRows), :);
if you don't have randsample, use
ind = randperm(size(data,1));
ind = ind(1:nRows);
randomlySelected=data(ind, :);
Image Analyst
Image Analyst le 15 Oct 2015
If you want to do it all in one line, and if you have the Statistics and Machine Learning Toolbox, use datasample
randomlySelectedRows = datasample(data, 150);
This returns a 150 row by 25 col matrix.
Otherwise you can use randperm() to make sure you don't select any row twice. Also use the second argument of it to get a sampling of 150 of the numbers:
data = rand(9000, 25); % Sample data.
nRows=150; % However many rows you want to extract in the subset.
rowsToExtract = randperm(size(data, 1), nRows); % Get list of the rows to use.
randomlySelectedRows = data(rowsToExtract, :); % Do the extraction.

9 commentaires

Noam
Noam le 29 Nov 2023
Hi, is there a way to do this, but to take random rows from each column?
Thanks : )
Noam
Noam le 29 Nov 2023
(as in, the rows selected from col 1 will not be the same rows selected from col 2)
rowsToExtract = randperm(size(data, 1), nRows); % Get list of the rows to use.
ind = sub2ind(size(data), rowToExtract, 1:size(data,2));
selected_values = data(ind);
This would chose a random row for column 1, a different random row for column 2, a different random row for column 3, and so on.
Noam
Noam le 29 Nov 2023
Hi walter, thank you for the response! I guess there is a spelling error in your code?
Unrecognized function or variable 'rowToExtract'.
Unless you meant that intentially, in which case I am not sure where that variable comes from. Fixing the spellin error does not work.
In any case my dataset looks like this.
data = [ 1 2 3 4 5 ; ...
10 20 30 40 50 ;
100 200 300 400 500 ]
And I would like to extract n random rows per column with non repeating rows per column and different indices per column.
So for n=2 my result wouls be 2x5 and look something like:
results = [ 1 2 30 4 5 ; ...
100 20 300 400 50 ]
Thanks for the help!
@Noam A try this:
% Create sample data.
data = reshape(1:500, 100, 5); % Make 100x5 matrix.
% Find height of existing matrix
[rows, columns] = size(data)
rows = 100
columns = 5
% Get random rows.
randomRows = randperm(rows);
% Reshape these into a 5 column matrix.
% This matrix will have row numbers for all 5 columns
% and the property that no column will have the
% same row selected as any other column.
randomRows = reshape(randomRows, [], columns)
randomRows = 20×5
78 83 56 82 16 6 37 53 17 55 72 40 50 81 15 90 2 91 48 44 68 43 27 22 3 25 58 54 69 65 12 97 59 89 7 92 5 61 1 38 46 74 26 71 77 99 13 57 88 10
% If you want only n instead of rows/columns (25),
% then just extract the first n rows of randomRows.
n = 3; % Whatever you want
randomRows = randomRows(1:n, :)
randomRows = 3×5
78 83 56 82 16 6 37 53 17 55 72 40 50 81 15
% Extract those rows from data
extractedData = zeros(n, 5); % Preallocate
for col = 1 : size(randomRows, 2) % For each column
% Extract the rows that are unique to this particular column.
theseRows = randomRows(:, col);
extractedData(:, col) = data(theseRows, col);
end
% Let's see what it looks like:
extractedData
extractedData = 3×5
78 183 256 382 416 6 137 253 317 455 72 140 250 381 415
Noam
Noam le 29 Nov 2023
Hi @Image Analyst thank you for the reply!
This is great, but it seems like there is an issue here with picking each row only once in the whole data matrix? I wanted to be able to sample each row of each column with equal chance.
I actually ended up solving this by making a matrix of randperm values from 1, data colulmn length, adding an offset to each column from that matrix which is the data column length * column index, and using those as the indices to randomly select row values from the data matrix. I would paste code in but having this interactive comment box open in firefox is interacting with my open matlab on windows and I can't select any code in the matlab program (weird). Anyways that solution did work : ) thanks for the help!
@Image Analyst matlab is acting better now, here was my solution:
nrows = 4;
ncols = 3;
% sample 2 rows from each column
ncolsamples = 2;
% data that's easy to identify columnwise
data = repmat(1:ncols,nrows,1).*(10.^[1:nrows]')
data = 4×3
10 20 30 100 200 300 1000 2000 3000 10000 20000 30000
% a little hacky could probably do it better..
randidx = sort(cell2mat(arrayfun(@(k) randperm(nrows,ncolsamples),ones(ncols,1),...
'UniformOutput',false)),2)'
randidx = 2×3
1 3 2 4 4 4
coloffset = (0:ncols-1)*nrows %key to add to the indices
coloffset = 1×3
0 4 8
%randidx = randidx(:) % will not work
randidx = randidx + coloffset
randidx = 2×3
1 7 10 4 8 12
%randidx = randidx(:) % will work
data(randidx)
ans = 2×3
10 2000 300 10000 20000 30000
Image Analyst
Image Analyst le 29 Nov 2023
@Noam A Not sure what you mean (despite reading it several times). My code does pick every row once, and doesn't repeat any rows for any column. Every column will use a unique set of row numbers to extract from. Each and every row is not chosen of course unless your n is chosen precisely to make sure that happens. If n is small then only some of the rows are chosen obviously.
Not sure if your code calls randperm again for each column (sounds like it), but that could possibly give two columns having the same row chosen. If you don't call randperm for each row, then of course all columns will have the same set of rows extracted.
Noam
Noam le 29 Nov 2023
Modifié(e) : Noam le 30 Nov 2023
@Image Analyst yes, I guess I wasn't clear. I wanted a random set of samples from each column. It's fine if rows get repeated, but what I didn't want is to pick n random row indicies, and then use those same indices for each column. This is what previous answers in this original thread were doing, as far as I could tell. Basically I am treating each column as its its own dataset and picking n random samples from that dataset. I found a solution that worked (see code I posted above), though I am sure it could be optomized.Thanks again!

Connectez-vous pour commenter.

Catégories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by