When does distributing data improve computation time over overhead expense?

Question

0 votes

I am trying to understand how to use distributed arrays or spmd code in order to optimize my computation. Specifically, I am trying to implement this in the context of an iterative solver. Essentially, within my algorithm code, I have just converted all of my input and starting vectors to distributed arrays. For example, the right hand side vector b and input matrix A become

A = distributed(A);
b = distributed(b);

The problem is that each iteration of the solver is now taking on the order of 100 times longer per iteration (and thus per solve) compared to when I did not distribute the data.

I am wondering if there is some threshhold at which the distributed memory wins out over not distributing, perhaps if the problem size gets large enough. My initial tests were on relatively small sparse matrices, around 3000x3000. But I have plenty more matrices of larger size to test on, up to around 10E6x10E6.

I also did some rudimentary tests just computing dot products of distributed random vectors. The time to compute a simple dot product where the vector was stored across distributed memory was consistently on the order of 100 times longer to compute compared to not distributing the vector.

rng('default');
v = rand(1,2000000); % Test vector across different sizes
tic
vv_true = v*v' % 1) compute dot product without distributing v
toc
v = distributed(v);
spmd
        v % Verify the dot product distributed across workers
end
tic
vv_dist = v*v' % 2) compute dot product after distributing v
toc

This simple test is important to me because the iterative solver I am working on uses a substantially smaller number of dot products compared to traditional iterative methods like BiCGStab. In my tests, I really would like to show the speedup from the algorithm structure when the problem is done in the context of distributed memory, as I expect it would be in practice if the solver were used to solve really large data sets.

So, it is in one sense reassuring to see that computing dot products does take longer in a distributed memory context compared to a non-distributed (is there another term?) context. But, in another sense it is incredibly impractical to solve these intensive and time consuming problems if it is actually going to take 100 times longer. I simply don't even have the compute time on the cluster I work with to turn a 6-hour problem into a 600-hour one.

My question is three-fold:

Am I using distributed memory correctly and in the most efficient way in the context of an iterative solver?
At what point does the overhead cost of transferring the results between workers become smaller than the benefit of having multiple workers do the work?
In the simple dot product code above, should it really take ~100 times longer to compute the dot product on a distributed data framework compared to a non-distributed one?

Thank you!

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Oli Tissot le 26 Juin 2026

A few things to note:

In general, if you problem fits in a single node then yes distributed will pretty much always be slower. This is because the communication cost is higher. MATLAB itself is multi-threaded. distributed is not like multi-threading even for a "Processes" pool, it mimics a truly distributed environment so it needs to do memory copy, has extra logic assuming several workers are doing the computation together, etc.
In your particular case, if you simply rewrite a bit the way to compute the dot product by assuming v is a column vector and you compute v'*v then the computation is much quicker for distributed (it is "only" 18x slower than MATLAB...). The reason is that MATLAB has a way to parse '* as a whole and not transpose the left-hand side explicitly - and distributed does that as well. Also printing the result for distributed induces an extra cost to bring part of the result back to the client.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

Stephen23 le 25 Juin 2026

Déplacé(e) : Matt J le 25 Juin 2026

1 vote

Let’s break your three questions down clearly.

1) Yes, syntactically and conceptually, your usage is correct. However, correctness is not the same as efficiency. The issue is where the work is happening and what communication is required.

2) Distributed memory helps when compute cost >> communication cost. That is the key. For example:

a) Dot product: ca. 2 flops per element has very low compute cost, but global communication is very expensive relative to compute cost. Terrible for distributed.

b) Dense matrix-matrix multiply: High compute intensity. Excellent scaling

Rule of thumb: distributed MATLAB starts to make sense when the data does not fit in memory on one node OR per-worker workload is large enough (seconds, not milliseconds). At your test sizes:

3000×3000 sparse matrix is tiny in HPC terms
Even 2e6 vector is still modest

Communication cost dominates, so distributed is slower.

3) Yes

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Benjamin le 25 Juin 2026

Thanks for your help.

In my case, what I really wish to demonstrate is that my solver algorithm adapts well to very large HPC context because it uses so few dot products compared to traditional solvers like BiCGStab. My understanding is that for very large systems, forgoing dot products will be advantageous because you no longer have to worry about the communication overhead. (As I've discussed, it can take ~100 times longer when distributed over several workers.)

Do you think there is a way to demonstrate this on a smaller "test" matrix (say, 10 million by 10 million) without having to scale to ultra-large ones? Of course, I need it to finish in a reasonable amount of time. Would I have to scale to GPU computing in this case or, for the sake of illustration, can it be demonstrated in a CPU context like I described in my post?

Connectez-vous pour commenter.

Answer 2

dpb le 25 Juin 2026

Déplacé(e) : dpb le 25 Juin 2026

0 votes

A dot product is quite low computational burden as compared to memory fetch.

Unless the sparsity is so low as to make the memory requirements for the storage of the elements exceed physical memory, you'll invariably do better single-threaded.

The fastest times will be by using the GPU and gpuArray. If you can stand it, using single instead of double precision cuts down the memory by 2X and will likely also show a speed improvement thereby.

2 commentaires
Afficher Aucune Masquer Aucune

Benjamin le 25 Juin 2026

Hi dpb,

Thanks for the comments. Does this mean that using distributed memory is always going to be a negative effect when solving large sparse systems of linear equations? I don't see how distributed memory could supply an advantage if all of the linear algebra operations are low computational burdens, unless your statement is true only for dot products but not for matrix-vector products (which are more computationally intensive).

Even so, in some tests I did I found that the matrix vector products slowed by approximately the same factor compared to when they were done on a single process and none of the data were distributed.

--Ben

dpb le 25 Juin 2026

To win with distributed memory the Arithmetic Intensity has to be greater than the hardware limitation of compute throughput over memory bandwidth.

It's pretty simple to illustrate by example (borrowed from an NVIDIA source some time back)--

If you had a hypothetical device with 100 Gb/s of memory bandwidth, then you have at most 25 floats/s of memory bandwidth. If the device had 625 FLOP/s of peak single precision arithmetic throughput, then any code which achieved peak arithmetic and memory throughput would have to be performing 25 FLOP per memory transaction. If the code performs less than that, it will be memory bandwidth bound.

One caveat if went to GPU is whether the data already reside on the GPU itself or not -- the overhead to get it there and retrieve results has to be counted in the overall expense as well -- if the data were already there, then a ratio less than minimum might still outperform the base CPU overall because of the higher processing rate of the GPU plus data transfer is could be less than straight CPU compute without the offload memory.

You might want to see <Linear Systems Direct Methods Distributed> and the corollary <Linear Systems Iterative Methods Distributed>

Connectez-vous pour commenter.

Answer 3

Matt J le 25 Juin 2026

Modifié(e) : Matt J le 25 Juin 2026

0 votes

Are your parallel workers all on a single computer, or are they distributed across a cloud? If all of this is on a single computer, it is unlikely that you will be able to use SPMD or other CPU-based parallelization functions to accelerate basic matrix operations. Remember, standard Matlab execution already uses internal parallelization routines to distribute matrix operations across local CPU cores. Typically, it does so in a much more optimized way than SPMD et al. can achieve. If you want to accelerate things at the level of individual matrix operations, you would probably want to use gpuArrays.

2 commentaires
Afficher Aucune Masquer Aucune

Benjamin le 26 Juin 2026

Hi Matt,

Thanks for your reply. It's not so much about accelerating the matrix operations as it is demonstrating the effectiveness of a method that altogether forgoes dot products compared to other methods (e.g., BiCGStab) that must compute dot products.

To answer your question, currently my parallel workers are all on the same HPC. I have been running CPU computations, usually only requesting a small amount of hardware (e.g., 16 cores on one node). Perhaps I need to scale to gpuArrays (which I have done on other testing scenarios) in order to distribute to multiple workers in a way that will not take an unreasonable amount of time.

As I've said, the desired goal (if possible) is to show that forgoing dot product in my new solver is even more advantageous in distributed compute contexts because you therefore forgo the expensive overhead communication. Is it possible to set up a computing environment that demonstrates this? So far I have only observed that the overhead costs of distributed data is so great that, given the compute resources I've allocated, the job won't even finish in an amount of time I can request from the HPC.

I hope this makes sense. Thank you in advance for your help.

--Ben

Matt J le 26 Juin 2026

Modifié(e) : Matt J le 26 Juin 2026

It's not so much about accelerating the matrix operations as it is demonstrating the effectiveness of a method that altogether forgoes dot products compared to other methods

We haven't seen your alternative, but it's hard to imagine a dot product being the computational bottleneck in anything, and certainly not in Matlab. It is among the most optimized computations.

Connectez-vous pour commenter.

When does distributing data improve computation time over overhead expense?

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Réponse acceptée

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Plus de réponses (2)

2 commentaires
Afficher Aucune Masquer Aucune

2 commentaires
Afficher Aucune Masquer Aucune

Catégories

Produits

Version

Tags

Community Treasure Hunt

When does distributing data improve computation time over overhead expense?

1 commentaire Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Réponse acceptée

1 commentaire Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

Plus de réponses (2)

2 commentaires Afficher Aucune Masquer Aucune

2 commentaires Afficher Aucune Masquer Aucune

Catégories

Produits

Version

Tags

Voir également

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

2 commentaires
Afficher Aucune Masquer Aucune

2 commentaires
Afficher Aucune Masquer Aucune