Efficient access and manipulation of arrays in nested cells

I have nested cells in the form mycell{i}{j,k} with arrays in each of those. I have not found examples that work to perform operations like getting the stats (e.g., max) of all the arrays without a loop to return something like cellstat(i,j,k). Another example is that I'm performing a fit with each array and it would be nice to gather all of one of the goodness stats into a single dimension array or to take stats of a goodness stat across i so I can see it at each j,k.
I think with an example of each of those, I could figure out anything else that comes up. Thanks!
**********************
Adding an example:
data = rand(2e5,1); % one data set, I have many
datay = rand(2e5,1); % y-coordinate of the data
dataz = rand(2e5,1); % z-coordinate of the data
The first task with this data, is to create a grid of y,z pairs and sort each data set into those. Since rand is [0,1], say the grid is every 0.1. This only has to be done once, but I suppose how the data are stored could affect the speed of future steps.
After that, I'm doing a windowed fit on the points that are sorted into each y,z bin for each dataset. There may be some trial and error here, and, while I can test on subsets, it would be helpful if the data are structured in a way that makes the fitting routine as fast as possible. Would any more information be useful?

8 commentaires

MATLAB does not support cell range dereferencing. I'd suggest a very small illustrative sample would help see about clever ideas and/or alternate storage schemes.
Sometimes with things of this nature, just because one can write complex referencing expressions, it still doesn't mean should. :J)
I was afraid of that...
In this scheme the i's are all different datasets and I'm grouping the data of each set into a j,k grid. Each j,k will be a different size, so I need cells. I guess they could be mycell{i,j,k} if that's easier? Otherwise, I guess I need to do things in a loop, right?
mycell{i,j,k} takes a lot more space and is a lot less efficient than mycell{i}(j,k)
data = rand(2e5,1); % one data set, I have many
datay = rand(2e5,1); % y-coordinate of the data
dataz = rand(2e5,1); % z-coordinate of the data
What, specifically, does "create a grid of y,z pairs and sort each data set into those" mean when there are only as many points in the data array as in each y,z array?
So each data point has a y and z coordinate. Imagine taking the y and z coordinates and binning them. In this example, with rand being from 0 to 1, you might put them in 0.1 sized bins. That would make 100 bins defined by their lower bounds as: (0,0), (0,0.1), (0,0.2)...(0.1,0), (0.1,0.1)...etc. Make sense?
Yeah, but there are two position vectors but only a single point for each so there can't be a "grid" z-y grid because there aren't points defined excepting at the one combined location. What is the definition of what to do with the point if the y bin is 28 but the z bin is 90? Or how to define jointly?
bins = 1:100; % this is how many datasets there are, so data = cell(100,1) and each data{i} = rand(2e5,1)
yrange = 0:0.1:1;
zrange = 0:0.1:1;
% assume preallocation of yz_data, but not shown
for m = 1:length(bins)
for i = 1:length(data{i})
for k = 1:length(yrange)-1
for l = 1:length(zrange)-1
yz_data{i}{k,l} = [yz_data{m}{k,l};data{m}{i}(datay{m}{i} > yrange(k) & datay{m}{i} < yrange(k+1) & dataz{m}{i} > zrange(l) & dataz{m}{i} < zrange(k+1))];
end
end
end
end
This is what I did. I think what I'm trying to ask (sorry for confusion) is if there's a storage scheme that will speed up future access.
OK, I let the "grid" and the initial stucture stuff confuse me...@Voss got back before I did and answered the basics; as he points out, there's no reason to create excessively complex storage structures; use the data you have the way it comes. I'd still be looking into how the data are initially created and what are the multiple cases for further consolidation, but if there really are 10E5 points per dataset, it's probably not a practical thing to actually combine until summarize results.
The only thing compared to @Voss's approach you might compare how
N=10;
edges=linspace(0,1,NY+1);
iyz=discretize([datay dataz],edges);
does compared to histcounts2. It returns the indices by column in one output array and uses the same binning in both directions so isn't quite as flexible but it might be a little faster, although given the tasks so far, I don't see performance as being a big issue if you don't make things more difficult than need be... :J>

Connectez-vous pour commenter.

 Réponse acceptée

data = rand(2e5,1); % one data set, I have many
datay = rand(2e5,1); % y-coordinate of the data
dataz = rand(2e5,1); % z-coordinate of the data
"The first task with this data, is to create a grid of y,z pairs and sort each data set into those. Since rand is [0,1], say the grid is every 0.1.... how the data are stored could affect the speed of future steps"
Store the bin index of each data point, so you know what bin each data point belongs to. (It's not necessary to make a new copy of the data with a different structure.)
NY = 10;
NZ = 10;
yedges = linspace(0,1,NY+1);
zedges = linspace(0,1,NZ+1);
[~,~,~,yidx,zidx] = histcounts2(datay,dataz,yedges,zedges);
"After that, I'm doing a windowed fit on the points that are sorted into each y,z bin for each dataset."
Maybe something like as follows. groupsummary uses the bin indices found in the previous step:
function out = your_fit_function(d,y,z)
[f,gof] = fit([y,z],d,'poly11');
out = {{f,gof}};
end
[C,BG] = groupsummary({data,datay,dataz},[zidx,yidx],@your_fit_function);
Now you have an sfit object and goodness-of-fit struct, returned from fit, for each grid cell:
C{1}
ans = 1x2 cell array
{1x1 sfit} {1x1 struct}
C{1}{:}
ans =
Linear model Poly11: ans(x,y) = p00 + p10*x + p01*y Coefficients (with 95% confidence bounds): p00 = 0.5103 (0.4767, 0.5439) p10 = -0.09779 (-0.5436, 0.348) p01 = -0.1282 (-0.559, 0.3026)
ans = struct with fields:
sse: 170.9652 rsquare: 2.5376e-04 dfe: 2035 adjrsquare: -7.2879e-04 rmse: 0.2898
And you can do what you want with those:
for ii = 1:3%numel(C)
fprintf(1,'region %0.1f<y<%0.1f, %0.1f<z<%0.1f:\n\n', ...
yedges(BG{2}(ii)),yedges(BG{2}(ii)+1),zedges(BG{1}(ii)),zedges(BG{1}(ii)+1));
fprintf(1,' fit object:\n');
disp(C{ii}{1})
fprintf(1,' goodness:\n');
disp(C{ii}{2})
fprintf(1,' \n');
end
region 0.0<y<0.1, 0.0<z<0.1:
fit object:
Linear model Poly11: (x,y) = p00 + p10*x + p01*y Coefficients (with 95% confidence bounds): p00 = 0.5103 (0.4767, 0.5439) p10 = -0.09779 (-0.5436, 0.348) p01 = -0.1282 (-0.559, 0.3026)
goodness:
sse: 170.9652 rsquare: 2.5376e-04 dfe: 2035 adjrsquare: -7.2879e-04 rmse: 0.2898
region 0.1<y<0.2, 0.0<z<0.1:
fit object:
Linear model Poly11: (x,y) = p00 + p10*x + p01*y Coefficients (with 95% confidence bounds): p00 = 0.505 (0.434, 0.576) p10 = -0.1254 (-0.5669, 0.316) p01 = 0.04957 (-0.3817, 0.4809)
goodness:
sse: 162.3938 rsquare: 1.8595e-04 dfe: 1961 adjrsquare: -8.3374e-04 rmse: 0.2878
region 0.2<y<0.3, 0.0<z<0.1:
fit object:
Linear model Poly11: (x,y) = p00 + p10*x + p01*y Coefficients (with 95% confidence bounds): p00 = 0.5457 (0.4333, 0.6581) p10 = -0.2367 (-0.6725, 0.1991) p01 = 0.09185 (-0.3504, 0.5341)
goodness:
sse: 164.6248 rsquare: 6.5738e-04 dfe: 1993 adjrsquare: -3.4548e-04 rmse: 0.2874

Plus de réponses (3)

Example:
function gof = getgof(PAGE)
[~, gof] = fit(PAGE somehow);
end
gof_stats = cellfun(@getgof, mycell, 'uniform', 0);
gof_stats = vertcat(gof_stats{:});
Matt J
Matt J le 1 Avr 2025
Modifié(e) : Matt J le 1 Avr 2025
There is no way to iterate over cells (nested or otherwise) without a loop, or something equivent in performance to a loop (cellfun, arrayfun, cell2mat, etc...).

4 commentaires

Can you give an example without a loop, e.g., cellfun?
@Walter Roberson gave the generic outline previously. Again, attaching a small representative dataset would undoubtedly elicit more specific code; when folks have to create data to work on besides, it's just more work for volunteers and don't know will actually match the actual use case, besides.
I'm still big on the other way to arrange the data will be much simpler to process and would avoid these hassles.
Matt J
Matt J le 1 Avr 2025
Modifié(e) : Matt J le 1 Avr 2025
Can you give an example without a loop, e.g., cellfun?
How would an example of cellfun help you? You said you are looking for something more efficient than a loop, and as I have said, nothing is more efficient than a loop when dealing with cell arrays.
dpb
dpb le 1 Avr 2025
Modifié(e) : dpb le 1 Avr 2025
To amplify on @Matt J's comment; at its heart all the cell-, array-, struct- functions are looping constructs internally that are "syntactic sugar" in replacing the for ... end loop with the single source code line. But, the performance of these cannot exceed that of JIT-compiled looping code and given that they have not been subject to all the optimizations Mathworks has made to for loops over the years including multi-threading, they all will be at least some slower than a "deadahead" for loop.
Functionally, a cellfun is a wrapper for an arrayfun -- it passes the derferenced cell to the function instead; you could construct the same with arrayfun if you did the dereferencing in the argument list for it. See this <recent post> for a general discussion and some pertinent remarks from TMW Staff members on differences.
MORAL: Do NOT assume that fewer lines of source code equate to faster execution speed.

Connectez-vous pour commenter.

dpb
dpb le 1 Avr 2025
Modifié(e) : dpb le 1 Avr 2025
The other alternative to investigate is to turn the metadata you're segregating/tracking by cell indices into real data in a flat table or array. Ideally, those would be recognizable things like test number, date, whatever..., but they could for starters just be the indices. Then the power of <grouping variables> and or grpstats and/or varfun could be brought to bear on the problem. Large datasets can be dealt with tall arrays and/or memory mapping...findgroups

4 commentaires

I believe I could reorganize the data into a table. For example the j,k above could be rows and columns and each entry could be a cell with a number of arrays equal to the number of datasets (all datasets are organized into the same j,k). Not sure if that's the best way. I need to iterate on the operations I'll be performing, so organizing the data in a way that makes those operations faster is what's most important. Do you know what option would allow for the fastest computation? Thanks!
dpb
dpb le 1 Avr 2025
Modifié(e) : dpb le 1 Avr 2025
"...j,k above could be rows and columns and each entry could be a cell with a number of arrays equal to the number of datasets"
That sounds like yet more nightmares and not at all what I would envision. Again, give us an actual representative dataset we can actually poke at rather than trying to just describe.
The point would be that "dataset" would become a variable as would all other metadata that can then be used for selection/grouping for calculations without having to dereference a bunch of cells and then try to put the results back together.
"Do you know what option would allow for the fastest computation?"
Not a priori, without a more specific example of what you actually are working with and what iterations you're talking about, no.
Unless the datasets are truly huge or the iterations are deep in loops, whether it's a few more msec or not is probably immaterial, particularly if one takes into consideration anything at all for the development time.
I believe I could reorganize the data into a table
Accessing a range of table rows is notably less efficient than accessing a range of rows of a numeric array.
dpb
dpb le 1 Avr 2025
Modifié(e) : dpb le 1 Avr 2025
"... turn the metadata you're segregating/tracking by cell indices into real data in a flat table or array." (emphasis added...dpb)
The table is awfully convenient for display and is generally "fast enough" ...but, agreed, findgroups and splitapply to do the calculations will be faster on an array than will be varfun or grpstats on a table.
I was interpreting the Q? about speed as including the existing cell array structure as well, not just the comparison of an array to a table. Dereferencing a cell itself is generally quick, but by the time one calls cellfun() a number of times and then has to reconstruct/collect the results, who knows how it might compare?
But, it's pretty tough to attack @Dan Houck's real problem without an example to poke at...others may be able to write air code that might be applicable to his actual situation, but I'm not that clairvoyant and as @John D&#39;Errico was complaining the other day, the Crystal Ball TB is notably dark these days.

Connectez-vous pour commenter.

Catégories

Produits

Version

R2024b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by