When does distributing data improve computation time over overhead expense?
Afficher commentaires plus anciens
I am trying to understand how to use distributed arrays or spmd code in order to optimize my computation. Specifically, I am trying to implement this in the context of an iterative solver. Essentially, within my algorithm code, I have just converted all of my input and starting vectors to distributed arrays. For example, the right hand side vector b and input matrix A become
A = distributed(A);
b = distributed(b);
The problem is that each iteration of the solver is now taking on the order of 100 times longer per iteration (and thus per solve) compared to when I did not distribute the data.
I am wondering if there is some threshhold at which the distributed memory wins out over not distributing, perhaps if the problem size gets large enough. My initial tests were on relatively small sparse matrices, around 3000x3000. But I have plenty more matrices of larger size to test on, up to around 10E6x10E6.
I also did some rudimentary tests just computing dot products of distributed random vectors. The time to compute a simple dot product where the vector was stored across distributed memory was consistently on the order of 100 times longer to compute compared to not distributing the vector.
rng('default');
v = rand(1,2000000); % Test vector across different sizes
tic
vv_true = v*v' % 1) compute dot product without distributing v
toc
v = distributed(v);
spmd
v % Verify the dot product distributed across workers
end
tic
vv_dist = v*v' % 2) compute dot product after distributing v
toc
This simple test is important to me because the iterative solver I am working on uses a substantially smaller number of dot products compared to traditional iterative methods like BiCGStab. In my tests, I really would like to show the speedup from the algorithm structure when the problem is done in the context of distributed memory, as I expect it would be in practice if the solver were used to solve really large data sets.
So, it is in one sense reassuring to see that computing dot products does take longer in a distributed memory context compared to a non-distributed (is there another term?) context. But, in another sense it is incredibly impractical to solve these intensive and time consuming problems if it is actually going to take 100 times longer. I simply don't even have the compute time on the cluster I work with to turn a 6-hour problem into a 600-hour one.
My question is three-fold:
- Am I using distributed memory correctly and in the most efficient way in the context of an iterative solver?
- At what point does the overhead cost of transferring the results between workers become smaller than the benefit of having multiple workers do the work?
- In the simple dot product code above, should it really take ~100 times longer to compute the dot product on a distributed data framework compared to a non-distributed one?
Thank you!
1 commentaire
Oli Tissot
il y a environ 10 heures
A few things to note:
- In general, if you problem fits in a single node then yes distributed will pretty much always be slower. This is because the communication cost is higher. MATLAB itself is multi-threaded. distributed is not like multi-threading even for a "Processes" pool, it mimics a truly distributed environment so it needs to do memory copy, has extra logic assuming several workers are doing the computation together, etc.
- In your particular case, if you simply rewrite a bit the way to compute the dot product by assuming v is a column vector and you compute v'*v then the computation is much quicker for distributed (it is "only" 18x slower than MATLAB...). The reason is that MATLAB has a way to parse '* as a whole and not transpose the left-hand side explicitly - and distributed does that as well. Also printing the result for distributed induces an extra cost to bring part of the result back to the client.
Réponses (3)
Let’s break your three questions down clearly.
1) Yes, syntactically and conceptually, your usage is correct. However, correctness is not the same as efficiency. The issue is where the work is happening and what communication is required.
2) Distributed memory helps when compute cost >> communication cost. That is the key. For example:
a) Dot product: ca. 2 flops per element has very low compute cost, but global communication is very expensive relative to compute cost. Terrible for distributed.
b) Dense matrix-matrix multiply: High compute intensity. Excellent scaling
Rule of thumb: distributed MATLAB starts to make sense when the data does not fit in memory on one node OR per-worker workload is large enough (seconds, not milliseconds). At your test sizes:
- 3000×3000 sparse matrix is tiny in HPC terms
- Even 2e6 vector is still modest
Communication cost dominates, so distributed is slower.
3) Yes
1 commentaire
Benjamin
il y a environ 2 heures
A dot product is quite low computational burden as compared to memory fetch.
Unless the sparsity is so low as to make the memory requirements for the storage of the elements exceed physical memory, you'll invariably do better single-threaded.
The fastest times will be by using the GPU and gpuArray. If you can stand it, using single instead of double precision cuts down the memory by 2X and will likely also show a speed improvement thereby.
2 commentaires
Benjamin
il y a environ une heure
dpb
il y a 6 minutes
To win with distributed memory the Arithmetic Intensity has to be greater than the hardware limitation of compute throughput over memory bandwidth.
It's pretty simple to illustrate by example (borrowed from an NVIDIA source some time back)--
If you had a hypothetical device with 100 Gb/s of memory bandwidth, then you have at most 25 floats/s of memory bandwidth. If the device had 625 FLOP/s of peak single precision arithmetic throughput, then any code which achieved peak arithmetic and memory throughput would have to be performing 25 FLOP per memory transaction. If the code performs less than that, it will be memory bandwidth bound.
One caveat if went to GPU is whether the data already reside on the GPU itself or not -- the overhead to get it there and retrieve results has to be counted in the overall expense as well -- if the data were already there, then a ratio less than minimum might still outperform the base CPU overall because of the higher processing rate of the GPU plus data transfer is could be less than straight CPU compute without the offload memory.
You might want to see <Linear Systems Direct Methods Distributed> and the corollary <Linear Systems Iterative Methods Distributed>
Are your parallel workers all on a single computer, or are they distributed across a cloud? If all of this is on a single computer, it is unlikely that you will be able to use SPMD or other CPU-based parallelization functions to accelerate basic matrix operations. Remember, standard Matlab execution already uses internal parallelization routines to distribute matrix operations across local CPU cores. Typically, it does so in a much more optimized way than SPMD et al. can achieve. If you want to accelerate things at the level of individual matrix operations, you would probably want to use gpuArrays.
2 commentaires
Benjamin
il y a environ 15 heures
It's not so much about accelerating the matrix operations as it is demonstrating the effectiveness of a method that altogether forgoes dot products compared to other methods
We haven't seen your alternative, but it's hard to imagine a dot product being the computational bottleneck in anything, and certainly not in Matlab. It is among the most optimized computations.
Catégories
En savoir plus sur Parallel Computing Fundamentals dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!