Big Data Analysis with Linear Regression
1 vue (au cours des 30 derniers jours)
I am doing a project to predict how many cpus will be needed to process a huge file (.nc) of climate data in less than 2 hours (7200s). Sequentially it takes more than 100,000 seconds.
I have the entire program done to process data sequentially and in parallel, up to 8 workers (limit of my cpu). The program takes the datafile, that has data for an entire day of climate data, and divides it in hours (25) so it can process hourly. After the processing is done, i used a stopwatch in the code to record the time taken for each number of workers.
To be easier to process and test the parallell processing, I am using a subset of the data (entire file has more than 270,000 blocks of data).
How can I use the time taken from a subset of the data to extrapolate the cpus needed for the entire data file? I have been lost in this problem for the entire day...
Thanks in advance!
Alvaro le 29 Déc 2022
It's not straighforward to calculate the number of workers that you would need to process your data in less than 2 hours.
Amdahl's law might give you a bit of a formal approach if you are looking to write down some rough calculations.
I would try to determine the number of cores you need by trial-and-error. Since you are looking for approximately a 14x speedup from the serial computation, a rough guess would be to start with 14 workers in your cluster and clock the time. This assumes that your computations are highly suited for parallelization, but, as noted above, it's likely not that simple. From there, try more or less cores until you can fine tune it to the time you are looking for. It could be worth doing a more thorough experiment to determine the optimal number of workers for your process if you need to analyze a large number of those data files in less than 2 hours.