version 1.0.0.0 (10.6 KB) by
John D'Errico

Consolidates common elements in x (may be n-dimensional), aggregating corresponding y.

Consolidator has many uses. It was designed to solve an interpolation problem and a Delaunay problem, but I've added other uses too. It can serve as a tool which counts the number of replicates of each point, or as simply an implementation of unique(x,'rows'), but with a tolerance on that unique-ness.

Interpolation fails when there are replicate x values. Often it is recommended to form the mean of y for the replicate x values, eliminating the reps. Consolidator does this, and allows a tolerance on how close two values of x need be to be considered replicates. x may have multiple columns, i.e., it works on multi-dimensional data. x may even be a character array.

This same problem is seen both in interp1 and in griddata. Delaunay and delaunayn are also not robust when called with data that has replicates or near replicates.

Example usages:

% counting replicates

x = round(rand(100000,1)*2);

[xc,yc] = consolidator(x,[],'count');

[xc,yc]

ans =

0 25160

1 49844

2 24996

% aggregate y for the unique elements in x

% y = x(:,1) + x(:,2) + error

x = round(rand(100000,2)*2);

y = sum(x,2)+randn(size(x,1),1);

[xc,yc] = consolidator(x,y,'mean');

[xc,yc]

ans =

0 0 0.0054

0 1.0000 0.9905

0 2.0000 1.9895

1.0000 0 0.9957

1.0000 1.0000 1.9970

1.0000 2.0000 2.9988

2.0000 0 2.0136

2.0000 1.0000 2.9985

2.0000 2.0000 3.9891

Alternate usage using a function handle:

[xc,yc] = consolidator(x,y,@mean);

The aggregation can also be of many types. Min, max, mean, sum, std, var, median, prod, as well as geometric and harmonic means, plus the simple count option. Use of a function handle allows for

any aggregation the user may desire.

Consolidator is very different from accumarray.

Note that accumarray builds a potentially huge

array, filled with zeros. This array cannot be sparse in higher than 2 dimensions. Also, accumarray does not allow a tolerance. Its first argument MUST be an index. Finally, consolidator works on strings too.

John D'Errico (2021). Consolidator (https://www.mathworks.com/matlabcentral/fileexchange/8354-consolidator), MATLAB Central File Exchange. Retrieved .

Created with
R14SP1

Compatible with any release

**Inspired:**
Experimental (Semi-) Variogram, Patch Slim (patchslim.m), Co-Blade: Software for Analysis and Design of Composite Blades

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!Create scripts with code, output, and formatted text in a single executable document.

aldburgjasongrigHi,

can I use it to derive percentiles? say the 95th percentile?

Thanks,

Iason

Alessandro MasulloAlessandro MasulloPerfect. Thank you for this work

Sergei P.MatthiasHi John,

great submission!

I have one minor adjustment that would allow for individual tolerances, even if it's a bit ugly:

% consolidate elements of x.

% first shift, scale, and then ceil.

if numel(tol) < size(x,2)

tol = repmat(tol,1,size(x,2));

end

bgZ = tol>0;

xhat = x;

if any(bgZ)

xhat(:,bgZ) = x(:,bgZ) - repmat(min(x(:,bgZ),[],1)+tol(bgZ)*eps,n,1);

xhat(:,bgZ) = ceil(bsxfun(@rdivide,xhat(:,bgZ),tol(bgZ)));

end

Hope it helps someone.

Reza Farrahi MoghaddamIris HinrichsThis function is exactly what I was looking for. Thanks for providing it, John!

I just discovered a minor bug:

It happened that I applied consolidate to x = 0.2 and y = [11 6.8].

[xc, yc] = consolidator(x,y, '@nanmean')

xc = 0.2

yc = 11

The last value of y is gone; the consolidator somehow "swallowed" it.

Although it does not make sense to consolidate an array that only has one row, the application of this function in this way can happen, especially when processing a lot of different arrays automatically.

VagnerIt just work! Many thanks.

Faraz OloumiWillSorry John,

After I restarted everything started working great. Not sure what the problem was but doesn't seem to be related to the consolidator function.

Thanks for your attention

John D'ErricoWill - Sorry, but you need to be more clear about your problem. I can't guess at the issue. Simplest is to send me the data that has a problem, as consulting in the comments is not my choice.

WillI need some help with this function. Seems to be working except a column of data I'm working with. Tried using 'mean' and @nanmean and both result in a column filled with only NaNs. There is numeric data present, I can see it in the y variable, and it appears to show up in ycon as a 0 until line 258 where:

ycon(count==1,:) = y(ec==1,:)

ycon becomes nothing but NaNs

TungIt works but it changes the order of rows.How can i merge duplicates but still keep the same order?

Thanks

SutiYavor KamerDear John,

Regarding my previous comment, I found out that for that specific test the function performs relatively better if i change line 204

iu = [true;any(diff(xhat),2)];

to

iu = [true;any(abs(diff(xhat))>1,2)];

I also have a hunch that the sortrows (based on the 1st dimension column) on line 199 could be improved to take into account all possible column order permutations. I tried to do it but got into some complications and gave up.

Yavor KamerDear John,

Your consolidator function proved to be really indispensable for my Delaunay triangulations. However when I tried to test it with a set of points perturbed around 5 centers within an uncertainty radius I couldn't retrieve the initial centers.

unc=0.2;

mat_i=[1 0 0; 1 2 0; 0 3 0; 1 1 0; 2 1 2];

mat_all=mat_i;

for i=1:100

mat_all=[mat_all; mat_i+(rand(size(mat_i))-0.5)*unc;];

end

mat_c = consolidator(mat_all,[],[],unc);

For one realization the last two rows of mat_c end up to be:

1.917 0.900 2.067

2.006 1.001 2.000

which is inconsistent with the tolerance (0.2). Is this an expected result or is there something wrong with my test?

Thank you

Ralph SpitzerAwsome function. Helped to solve my SQL-like "group by" problem. Consolidated my 2 million records in next to no time. Thank you!

ade77Beautiful function. More beautiful when you use it in conjuction with cellfun. Exactly what I was looking for.

Mathworks, please be humble and include this function in MATLAB and pay appropriate fee for the creator.

Thanks John

Richard CrozierAmazing, yet another great code from John D'Errico, it seems like half the code I use will end up being written by him.

Brennan SmithThank you very much! I've been looking all over for a way to identify unique rows and tally the number of repeats, and this is by far the easiest solution - it worked on my first attempt and the outputs were very easy to plot. Great job!

GerryI just didn't realize "consolidator" can use other functions as its aggregation mode, in my case nanmedian etc. I have used "consolidator13" and couldn't get around the NaN data with it. Looks like the plain "consolidator" its the only one handling these other functions and I am sure it will do the trick for me. Thanks.

John D'ErricoWell, to some extent, tools like nanmean can help. For example...

x = ceil(5*rand(10,1));

y = rand(10,1);

y(2) = nan;

[xc,yc] = consolidator(x,y,@nanmean)

xc =

1

2

3

4

5

yc =

0.66434

0.36668

0.42507

0.16971

0.54419

If x has nans in it though, things get sticky. Consolidator does not survive nans there. While I could repair this to work for 1-d data, it would still fail for higher dimensions.

GerryPlease Help ...

I've been using consolidator with no problems and loving it. But I came across a data set with NaN values and it didn't work. I am getting a bunch of NaN even for the rows with real data. Is there any way around this? Thanks.

John D'ErricoMore digging shows that the behavior Christophe finds is a function of rounding, and of floating point arithmetic in general. But it is not something that I can make consolidator robust to, since variations at the least significant bit level will always cause problems in such a code.

This choice of a tolerance made by Christophe forces matlab/consolidator to perform a comparison between floating point numbers. With the tolerance set to exactly the difference between consecutive terms in the set provided, in some cases there MUST be a failure. PLEASE read this document:

http://docs.sun.com/source/806-3568/ncg_goldberg.html

The use of floating point arithmetic in MATLAB causes this to fail. Here, using a version of consolidator with a subtly different internal test, I get the result that Christophe did:

consolidator([1,2,3,3.01,6]',[],[],1)

ans =

1.5

3

3.01

6

Yet now change the tolerance by only an infinitesimal amount, and we can get yet a different set of rounding results.

consolidator([1,2,3,3.01,6]',[],[],1-10*eps)

ans =

1

2.5

3.01

6

consolidator([1,2,3,3.01,6]',[],[],.9999999999999)

ans =

1

2

3.005

6

Again, these differences arise because of floating point arithmetic and the use of a tolerance that is so close to the stride between members of the set. This is not something that I can change, fix, repair, or code in a better way, because if I did make a change then some other set of data would cause the same problems.

I will argue that this is what I call the transitivity problem. When you specify a tolerance of 1, how is consolidator to resolve the set [1 2 3]? Are 1 and 2 to be lumped together? Or 2 and 3? Clearly, each of those pairs are the same to within a tolerance of 1. Yet we cannot lump them all into a single group, because 1 and 3 are not within the specified tolerance. Or should we? We might very logically argue to aggregate them down to any of these sets:

[1, 2, 3]

[1.5, 3]

[1, 2.5]

[1.5, 2.5]

[2]

The point is, beware of tests that compare floating point numbers. And beware of forcing code to make those tests. You can (and will) see virtually random results from doing so.

Finally, avoid use of a tolerance that is so close to the stride between elements of the set to be resolved. Consolidator is not designed to be a clustering tool, but to be a tool that will combine replicate values together and to survive small amounts of noise in the data. The tolerance allows minor variations in the numbers to be thus combined. If you try to use consolidator to cluster numbers together, it might succeed, but you can trip it up. And no matter what, the transitivity problem is important, and is not capable of resolution in an unambiguous manner, for ALL sets of data.

John

John D'ErricoChristophe: My guess is your test used a variable where some of the numbers were not exact integers, so there was some floating point trash involved. This caused the results to be slightly different from what you expect, not the programming of consolidator.

I claim that to be true because when I try the specific example shown, pasted directly into MATLAB, I DO get the expected result. (I don't know what MATLAB release your test was done in, as there can sometimes be release issues too. A different CPU can also sometimes cause subtle differences, although I think that neither release or CPU here are the problem.)

consolidator([1,2,3,3.01,6]',[],[],1)

ans =

1

2

3.005

6

In general, consolidator uses a simple scheme to do the aggregation. This is necessary for speed, and so that it will work efficiently in higher dimensions. Note that there will always be what I'll call the "transitivity" problem. Thus, suppose you wish to perform consolidation on the set [1 1.5 2], with a tolerance of 0.75.

Clearly 1 and 1.5 are within the desired tolerance, so they should be grouped together. But so are 1.5 and 2, so they too should be grouped. Yet 1 and 2 cannot be grouped together.

The point is, there is no scheme which will resolve any possible set of data, aggregating the points into an unambiguously reduced set that all will agree is correct.

Christophe LauwerysThanks for this great contribution.

However, unless I misunderstood the functionality, I would expect

consolidator([1,2,3,3.01,6]',[],[],1)

to return

1

2

3.005

6

However, it returns

1.5000

3.0000

3.0100

6.0000

Is this desired behavior? Wouldn't it make sense to aggregate 3 and 3.01 instead of 1 and 2?

Michael KrauseOliver WoodfordThis isn't entirely an ACCUMARRAYN (which I agree there definitely needs to be) because the aggregator function must (I believe) return a single value per column of the input matrix. However, ACCUMARRAY has the wonderful property of being able to return a cell array:

C = accumarray(A, B, [], @(x) {x});

I have had cause to use this functionality many times. Any chance you might add it to CONSOLIDATOR, John?

Andres T.Fortunately Loren's blog on accumarray links to here (as 'derivative work')! It's great the author took the time to publish pre-accumarray-versions, too. Thank you!

w sGreat and fast tool that I often use. The only thing I miss is that different tolerances apply to different columns of x. That'll be great.

chen liThe following is what I use to consolidating two list, and at the same time remove outliers in the YList. However it is calling consolidator three times.

Anyone has better idea?

**********************************

[xg, meany, Ind] = consolidator(xlist, ylist, 'mean');

[xg, stdy, Ind] = consolidator(xlist,ylist,'std');

notoutlier = find(abs(ylist-meany(Ind)) < 3*stdy(ind))

xlist = xlist(notoutlier);

ylist = ylist(notoutlier);

[xg, yg, Ind] = consolidator(xlist,ylist);

Ronald ClintonGreat and fast tool I've been using for a while. But as for "2007-09-08 Provided count information as a 4th output", the changed version seems not to be uploaded (18 October 2007)

Sergei KoulayevIt would be nice if the program would report how many elements fall into each cluster...

Lai Mun WooI've found this enormously handy to use. Excellent quick fix routine. Thank you for making it available.

gabriel asafteiJohn D'ErricoA.L. - I've uploaded a new release of consolidator, fixing several other minor problems too as noted in the change history. When Matlab Central recognizes the new release in a few hours, please verify that consolidator13 now runs properly, as I cannot test it below R14. Thank you for identifying the problem. I'm sorry about the inconvenience.

A. L.Possible fix for previous comment (limited testing):

Replace line 201:

count=accumarray(eb,1).';

with:

count = diff(find([iu; true])).';

A. L.The R13 version uses accumarray which I dont think was available until R14 (I may be wrong), which is rather disappointing if you wanted to use consolidator to add accumarray functionality to an older release.

Iram WeinsteinThis is a really useful function. However, when the aggregation option is 'count', I find that Duane Hanselmann's mmrepeat is much faster

Robert HalterHow is this different then accumarray?

Liang JinThis is exactly what I am looking for!

The hist() in MATLAB is too limited in functionality.

Michael EbstyneMuch needed addition to MATLAB functionality! For those coming from the SQL world, used to doing massive aggregations and wildly complex rolling of data sets in simple SQL statements, you've probably been looking for this. One suggustion... it would be killer to tackle multiple aggregate types across multiple columns.

Evan WellerI agree with Urs. Would be ab excellent inclusion into future releases of Matlab.

Exactly what I needed for my work.

urs (us) schwarzwow, what an (almost) flawlessly coded snippet of long-awaited code! it's too bad, however, that there are two minuscule issues with it:

- the help section is TOO wordy (almost a novel by itself) and MUST be streamlined to the very essential, bare bone

- the name CONSOLIDATOR is distracting (and to most people rather obfuscating) and (really!) should be changed to ACCUMARRAYN, which is what it really does: extend the functionality of this otherwise great addition to the ML family of pre-packaged functions (just consider how easily it preprocesses data for the statistics tbx's family of ANOVAs!)

altogether, this code is so essential one might even ask the dear people at TMW to include it (maybe even in mexed form) in one of the future releases

us