hey guys,
currently my function is really slow because of the mass of the data and because it uses only one thread.
Since i have a multicore Processor (Ryzen 5 3600, 6 Cores / 12 Threads), i want to make use of it by splitting my data and using multiple times the same function on these data and putting them back together.
I have found the spmd and parfor command
The raw steps which i want to to:
  1. split the Data (tables) n times
  2. give each worker enough parts of the splitted data and the raw data (which i need for the function)
  3. run a function which modifies the splitted data on each worker
  4. put all the splitted data back together
Also i am limited to functions in Matlab 2015b for my use.
How can i do that? Can you please help me?
This is what i tried:
workers = 12;
divider = ceil(specs.numberOfRows/workers);
split1 = data((data.ID <= divider),:);
split2 = data((data.ID > divider) & (data.ID <= divider*2),:);
split3 = data((data.ID > divider*2) & (data.ID <= divider*3),:);
split4 = data((data.ID > divider*3) & (data.ID <= divider*4),:);
split5 = data((data.ID > divider*4) & (data.ID <= divider*5),:);
split6 = data((data.ID > divider*5) & (data.ID <= divider*6),:);
split7 = data((data.ID > divider*6) & (data.ID <= divider*7),:);
split8 = data((data.ID > divider*7) & (data.ID <= divider*8),:);
split9 = data((data.ID > divider*8) & (data.ID <= divider*9),:);
split10 = data((data.ID > divider*9) & (data.ID <= divider*10),:);
split11 = data((data.ID > divider*10) & (data.ID <= divider*11),:);
split12 = data((data.ID > divider*11) & (data.ID <= specs.numberOfRows),:);
dataset_array={split1, split2,split3,split4,split5,split6,split7,split8,split9,split10,split11,split12};
parfor i=1:12
newDataset_array(i) = myFunction(dataset_array(i),data);
end
for i = 1:1:12
newData = [newData;newDataset_array(i)]
end
Thanks in Advance

11 commentaires

Jakob B. Nielsen
Jakob B. Nielsen le 15 Jan 2020
Modifié(e) : Jakob B. Nielsen le 15 Jan 2020
I think parfor only runs on parallel cores/workers with the parallel computing toolbox... I assume you have that? Can you give a little more info of what your issue is?
Owner5566
Owner5566 le 15 Jan 2020
Modifié(e) : Owner5566 le 15 Jan 2020
my main problem is, that i dont quite understand how to use parfor the optimal way
and how would you change the above code to make it better/faster
like
  • how to split the data according to workernumber and adding them back together afterwards (there should be a better/faster/easier way, shouldn't it?
  • how many workers is best?
  • is parfor the optimal tool for this?
And yes, i have that toolbox.
dpb
dpb le 15 Jan 2020
" i dont quite understand how to use parfor the optimal way"
Read the introductory documentation and study the examples carefully, then.
Guillaume
Guillaume le 15 Jan 2020
Modifié(e) : Guillaume le 15 Jan 2020
Can't really comment on the parfor bit as I don't have the parallel toolbox. As far as I know, your parfor code probably works as you want, but it's not clear why you're passing both a portion of data (as dataset_array(i)) and the whole of data.
With regards to your code. Numbered variables are always a bad idea, even temporary ones. For a start it forces you to needlessly repeat the same code several times (witness all your splitx = ... lines).
At the very least you should use a loop
workers = 12;
divider = ceil(specs.numberOfRows/workers);
%so much simpler than numbered variables
dataset_array = cell(1, numel(workers))
for idx = 1:workers
dataset_array{idx} = data((data.ID > divider*idx-1) & (data.ID <= divider*idx), :);
end
Probably better:
workers = 12;
destination = discretize(data.ID, workers) ; %split ID into workers bins
dataset_array = cell(1, numel(workers))
for idx = 1:workers
dataset_array{idx} = data(destination == idx, :);
end
or:
workers = 12;
destination = discretize(data.ID, workers) ; %split ID into workers bins
dataset_array = splitapply(@(rows) {data(rows, :)}, (1:height(data))', destination);
15 lines of code down to 3! And if you want to change the number of workers, you just have one line to edit instead of lots of copy/paste or deletions required.
Most likely, your myFunction takes a table as input, not a 1x1 cell array of table, in which case your parfor should be:
newDataset_array = cell(size(dataset_array))
parfor i=1:numel(dataset_array) %don't hardcode values
newDataset_array{i} = myFunction(dataset_array{i}); %Use {} indexing to get the content of the cell
end
Owner5566
Owner5566 le 15 Jan 2020
thanks, this is what i was looking for.
And yes, up there it was a typo with the brackets.
Owner5566
Owner5566 le 15 Jan 2020
@Guillaume i get this error
Error using discretize (line 61)
Second input, edges, must have at least 2 elements.
It's not in the release notes, but it appears that the number of bins option was added in R2016b.
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers))
should work for you.
@Guillaume thank you.
Now it works with this:
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers))
dataset_array = splitapply(@(rows) {data(rows, :)}, (1:height(data))', destination);
But it creates only 5 datasets, not 6 as in workers.
Why one less?
Oh, of course, N edges == (N-1) bins. Use
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers + 1));
Owner5566
Owner5566 le 15 Jan 2020
okay i already did it that way, just wanted to know if i missed anything.
But thanks. Works like a charm ;)
Guillaume
Guillaume le 15 Jan 2020
Comment by Owner5566 mistakenly posted as an Answer moved here:
Now i just need a way, to make the big data Available to all workers
The way i do it now, they all get it in the function, which leads to a lot of memory use.
Cant i make it available to all?
I need it for filtering in the functions

Connectez-vous pour commenter.

 Réponse acceptée

Guillaume
Guillaume le 15 Jan 2020

1 vote

For the record, this is my suggested modification to the original code:
workers = 12;
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers + 1)); %split ID into workers bins
dataset_array = splitapply(@(rows) {data(rows, :)}, (1:height(data))', destination);
which is a good demonstration of why numbered variables are bad. 3 lines instead of 15 and dead easy to change the number of workers.
However that doesn't help at all with your parallel computation. I'm not entirely clear why you'd want to pass the whole dataset to each worker. If all the data is needed by each, then you're sort of losing the benefit of parallelisation. In addition, it may well be that the overhead of passing the data to each worker cancels any speed up in parallelisation.
If you need to pass the whole table to each worker, then there's not much benefit of passing a section of the table at the same time. You're better off just passing the row indices that the worker should work on and let the worker extract these rows. That should result in less overhead:
workers = 12;
destination = discretize(data.ID, linspace(min(data.ID), max(data.ID), workers + 1)); %split ID into workers bins
processeddata = cell(1, workers);
parfor i = 1:numel(workers)
processeddata{i} = dowork(data, destination == i); %pass the whole of data and a logical vector indicating which row the worker should work on
end
with
function result = dowork(data, workingrows)
datatoworkon = data(workingrows, :);
%...
end
But, if you can I would strongly recommend you upgrade to a more recent version of matlab. R2016b introduced tall arrays and tables which are basically tables designed for big data. Operations on these are automatically parallelised if you have the parallel toolbox.
Finally, for processing big data you also have the mapreduce functions which should be available in your version. Again, mapreduce automatically parallelise the work for you. mapreduce is not suitable to all kind of processing and can be a bit of a learning curve if you've never used it but it may be useful for what you're doing.

9 commentaires

Owner5566
Owner5566 le 15 Jan 2020
Modifié(e) : Owner5566 le 15 Jan 2020
yeah you are right. this way i parse less to the workers.
i will aply it this way.
Thanks again.
But i need the whole dataset for comparison available. Because i need it for calculations adding values according to dependencies etc.
I am sorry but i can not publish what i am working on. I know this would make it easier to understand. But for now you helped me alot and it boosted the speed with parfor alot.
Owner5566
Owner5566 le 15 Jan 2020
Modifié(e) : Owner5566 le 15 Jan 2020
i have still one problem with your latest version.
i dont know why, but it starts only one parfor.
The other threads are idling.
Owner5566
Owner5566 le 16 Jan 2020
@Guillaume can you please help me?
Your new way of parsing is nice. But it just starts one parfor and ends with only one Dataset processed.
Guillaume
Guillaume le 16 Jan 2020
As I said, I don't have the parallel toolbox so can't really help much with this aspect.
Don't you have to define the number of workers independently of parfor? With parpool maybe?
Owner5566
Owner5566 le 16 Jan 2020
the thing is, your first solution worked.
I am using that right now. But your new solution, where i parse only once the data only starts one parfor and therefore only one part of the data is processed.
Guillaume
Guillaume le 16 Jan 2020
So, only the first cell of processeddata contain something at the end? If so, sorry I can't explain it but as I said the parallel toolbox is not my expertise.
Owner5566
Owner5566 le 16 Jan 2020
Modifié(e) : Owner5566 le 16 Jan 2020
yes, only that one.
But also the others are not even triggered.
i have an fprintf inside the parfor. which prints the number of parfor which is running. And the others are not even triggerd, also the pool workers a not doing anything.
maybe the "numel" does not work
parfor i = 1:numel(workers)
end
I just put there "workers" in my old version which is currently working
parfor i = 1:workers
end
I will try it again with only "workers" and see if that works.
Do'h! Didn't notice the numel which is clearly a typo. numel(workers) is always going to be one. It should indeed have been
parfor i = 1:workers
%...
end
or
parfor i = 1:numel(processeddata)
%...
end
Owner5566
Owner5566 le 16 Jan 2020
okay, then thank you again!

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

Produits

Version

R2015b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by