Memory optimization and code speed up
Afficher commentaires plus anciens
Hi! I have a function to compute what is called Fuzzy Rand Index. It is a way to compare a fuzzy (soft) or probabilistic data partition (clustering result) with a reference hard partition. The math behind this is basically an algebra with t-norms and t-conorms (s-norms). Among some choices, we can use min() as t-norm and max() as s-norm or algebraic product and probabilistic OR (probor) also as t-norm and s-norm. In the present case I'm using min() and max().
The input data are 2 partition matrices (Nxd) in which N is the number of data points and d the number of clusters. So each element h_ji of a partition matrix is precisely the probability of datapoint j belongs to cluster i. In general, N >> d (e.g. N = 20000 to 50000 points and d is tipically 2 to 5 clusters).
The math behind the index is such that I have to compute sort of a "pairwise min" between each data point and all others within the same column (min(h_ji,h_ki)), so I'll have N*(N-1)/2 values for each of the d columns. To these d arrays of size N*(N-1)/2, I must apply the max s-norm element wise to end up with a quantity V that is an array of size N*(N-1)/2.
Also I must compute the same "pairwise min" t-norm between every data point and each other but mixing the columns like min(h_ji,h_kp). So I'll end up with factorial(d)/factorial(d-2) arrays of size N*(N-1)/2 and to these I must also apply the max s-norm element wise to end up with a quantity X that is an array of size N*(N-1)/2.
What I call "pairwise min" is much like pdist do, comparing each point with the next ones, but not with the previous ones.
Later, this V and X from one partition matrix are compared with another V0 and X0 from the reference partition matrix with min t-norm.
I have implemented this in 3 different ways.
The slower and the one that has high memory consumption is the more compact form as:
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = ones(d,2).*(1:d)';
V = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).','UniformOutput',false)),1:size(comb,1),'UniformOutput',false)).');
comb = nchoosek(1:d,2);
comb = [comb;fliplr(comb)];
X = max(cell2mat(arrayfun(@(k) cell2mat(arrayfun(@(j) pdist2(U((j+1):N,comb(k,1)),U(j,comb(k,2)),@min),(1:(N-1)).','UniformOutput',false)),1:size(comb,1),'UniformOutput',false)).');
In some intermediate form, I've tried to use parallel code like
% you can use as an example U = rand(20000,4);
U = rand(30000,4); % just for test
[N,d] = size(U); % U is a partition matrix
comb = nchoosek(1:d,2);
ncomb = size(comb,1);
X = cell(n-1,1);
V = cell(n-1,1);
parfor j=1:n-1
tmp = zeros(n-j,1);
for k=1:ncomb
tmp = max(tmp,max(min(U(j,comb(k,1)), U((j+1):end,comb(k,2))) , min(U(j,comb(k,2)), U((j+1):end,comb(k,1)))));
end
X{j} = tmp;
tmp = zeros(n-j,1);
for k=1:d
tmp = max(tmp,min(U(j,k), U((j+1):end,k)));
end
V{j} = tmp;
end
X2 = cell2mat(X);
V2 = cell2mat(V);
Finally, the version I've got the best results in terms of speed and memory usage is the following, where I did the trick using circshifts and have used single type, as double precision is not important to me:
% you can use as an example A = rand(20000,4);
A = rand(30000,4,'single'); % just for test
[N,d] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,'single');
X = zeros(numelements,1,'single');
temp = zeros(N,1,'single');
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = func1(A,Ac,temp,d);
temp(:) = 0;
end
if N/2 - NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = func2(A,N,d,temp(1:N/2));
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = func1(A,Ac,N,d);
end
function temp = func1(A,Ac,temp,d)
for nc2 = 1:d-1
temp = max(temp,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function temp = func2(A,N,d,temp)
for nc2 = 1:d-1
temp = max(temp,max(min(A(1:N/2,:),circshift(A(N/2+1:end,:),-nc2,2)),[],2));
end
end
The problem is that this calculation must be made several times to compare a lot of different clustering results to the reference one and for datasets with N > 20000 things take a lot of time. So do you think there is room for more optimization and get this faster and more memory efficient than the 3rd version with the circshifts?
Also, as this is done multiple times (2300 times for each dataset), this function is being called inside a parfor. So each thread executes this function. But min and max are multi-threaded functions also. My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower.
10 commentaires
Lorenco Santos Vasconcelos
le 23 Déc 2024
Mike Croucher
le 10 Jan 2025
I ran your final script to see how fast it was and got an error
Unrecognized function or variable 'A'.
Error in myScript (line 10)
Ac = circshift(A,-nc1,1);
Lorenco Santos Vasconcelos
le 10 Jan 2025
Lorenco Santos Vasconcelos
le 10 Jan 2025
Modifié(e) : Lorenco Santos Vasconcelos
le 22 Jan 2025
埃博拉酱
le 12 Jan 2025
I have to say that your circshift is a genius idea, seemingly faster than pdist in MATLAB. Given this, it's hard for me to imagine any more room for improvement.
Lorenco Santos Vasconcelos
le 17 Jan 2025
Lorenco Santos Vasconcelos
le 17 Jan 2025
Walter Roberson
le 17 Jan 2025
Well, your latest code runs . Whether it produces the right answer is a different question.
Mike Croucher
le 17 Jan 2025
@Lorenco Santos Vasconcelos Yes, it runs. Thank you. I spent some time trying to make it faster but failed. I also tried using a GPU and it didn't work well.
I think that your strategy of paralleising at the level above this function is the right approach. Threads on parpools are complicated things and the behaviour of the number of threads per worker depends on if you have a process pool or a thread pool.
You said this "My approach was to set the parpool with 6 workers (12 threads) and left 2 workers (4 threads) out of the parpool willing that these 4 threads will be used for the min() and max() multi-threading. Is that right? I thought that using all workers within the parpool would disable the multi-threading of min() and max() and get things slower."
This is not correct. I'll write up how it works in a blog post soon.
What I suggest you do is take an empirical approach. You say you have 2300 iteratons of this per dataset. I would set up a version that works through 100-200 iterations and simply time it. First using no pool, then a pool of 2 workers, 4 workers and so on. Whatever is fastest is your sweet spot.
If you work at an institution that has a High Performance Computing cluster, even better.
It's not always the case that using all available cores results in the fastest execution and there can be many reasons for this. Its often algorithm and hardware dependent.
Lorenco Santos Vasconcelos
le 22 Jan 2025
Réponse acceptée
Plus de réponses (1)
Jeremy Hughes
le 21 Jan 2025
Following on the comments above:
"where I did the trick using circshifts and have used single type" - @Lorenco Santos Vasconcelos
I checked, and using single doesn't appear to improve the performance unless you use single everywhere; Since your input is double, mixing with single doesn't actually do what you expect. Likely it's being upcast internaly.
But if you use single throughout; then I see a different (from about ~13 seconds on my machine down to ~9). That might not be what you want if you need to preserve the input precision, but if you really don't care about that level of precsision, cast A to single, and get a 25% speedup without changing the code. This is my version, removing temporaries and making this agnostic to the type of A.
function [V,X] = repro_func(A)
[N,~] = size(A); % A is a partition matrix
NC = fix(N/2);
numelements = N*(N-1)/2;
V = zeros(numelements,1,"like",A);
X = zeros(numelements,1,"like",A);
for nc1 = 1:NC-1
Ac = circshift(A,-nc1,1);
V((nc1-1)*N + 1 : nc1*N) = max(min(A,Ac),[],2);
X((nc1-1)*N + 1 : nc1*N) = minmax1(A,Ac);
end
if N/2 - NC == 0
V((NC-1)*N + 1 : end) = max(min(A(1:N/2,:),A(N/2+1:end,:)),[],2);
X((NC-1)*N + 1 : end) = minmax2(A);
else
Ac = circshift(A,-NC,1);
V((NC-1)*N + 1 : end) = max(min(A,Ac),[],2);
X((NC-1)*N + 1 : end) = minmax1(A);
end
end
function res = minmax1(A,Ac)
sz = size(A);
res = zeros(sz(1),1,"like",A);
for nc2 = 1:width(A)-1
res = max(res,max(min(A,circshift(Ac,-nc2,2)),[],2));
end
end
function res = minmax2(A)
sz = size(A);
res = zeros(sz(1)/2,1,"like",A);
for nc2 = 1:width(A)-1
res = max(res,max(min(A(1:sz(1)/2,:),circshift(A(sz(1)/2+1:end,:),-nc2,2)),[],2));
end
end
Using circshift is letting you take advantage of vectorization in min at the cost of allocating more data, but I don't think it's the most optimal way to do what you want.
Probably you want to write out the nested loops, and vectorize the inner most call to get the most speed.
I don't think this is doing the same thing you want, but it's calculating the minimum between each element and each the following elements, then finding the max of those mins. It's much faster and visiting the same set of points:
function M = pairminmax(A)
arguments
A(:,1) {mustBeNumeric}
end
M = zeros(numel(A),1,'like',A);
for ii = 1:numel(A) - 1
M(ii) = max(min(A(ii),A((ii+1):numel(A))));
end
end
And this is the naive version which just does singleton expansions (and twice as many comparisons) which is only slightly slower than the one above.
max(min(A(:),A(:)'),[],2)
If I had the expressions written out, I could probably craft something specific to this case, but I hope this gives you enough to go on to write a faster loop.
Catégories
En savoir plus sur Parallel Computing Fundamentals dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!




















