Big Data Analysis with Linear Regression

Question

Joao Saavedra le 14 Juil 2021

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/878713-big-data-analysis-with-linear-regression

Réponse apportée : Alvaro le 29 Déc 2022

Hi all,

I am doing a project to predict how many cpus will be needed to process a huge file (.nc) of climate data in less than 2 hours (7200s). Sequentially it takes more than 100,000 seconds.

I have the entire program done to process data sequentially and in parallel, up to 8 workers (limit of my cpu). The program takes the datafile, that has data for an entire day of climate data, and divides it in hours (25) so it can process hourly. After the processing is done, i used a stopwatch in the code to record the time taken for each number of workers.

To be easier to process and test the parallell processing, I am using a subset of the data (entire file has more than 270,000 blocks of data).

How can I use the time taken from a subset of the data to extrapolate the cpus needed for the entire data file? I have been lost in this problem for the entire day...

Thanks in advance!

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Alvaro le 29 Déc 2022

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/878713-big-data-analysis-with-linear-regression#answer_1137947

It's not straighforward to calculate the number of workers that you would need to process your data in less than 2 hours.

https://www.mathworks.com/matlabcentral/answers/600148-will-increasing-the-number-of-workers-in-parallel-computing-matlab-harm

Amdahl's law might give you a bit of a formal approach if you are looking to write down some rough calculations.

https://en.wikipedia.org/wiki/Amdahl%27s_law

I would try to determine the number of cores you need by trial-and-error. Since you are looking for approximately a 14x speedup from the serial computation, a rough guess would be to start with 14 workers in your cluster and clock the time. This assumes that your computations are highly suited for parallelization, but, as noted above, it's likely not that simple. From there, try more or less cores until you can fine tune it to the time you are looking for. It could be worth doing a more thorough experiment to determine the optimal number of workers for your process if you need to analyze a large number of those data files in less than 2 hours.

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Big Data Analysis with Linear Regression

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (1)

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

Big Data Analysis with Linear Regression

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (1)

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens