version 2.2 (1.42 MB) by
Stephen Meehan

An algorithm for manifold learning and dimension reduction.

Given a set of high-dimensional data, run_umap.m produces a lower-dimensional representation of the data for purposes of data visualization and exploration. See the comments at the top of the file run_umap.m for documentation and many examples of how to use this code.

The UMAP algorithm is the invention of Leland McInnes, John Healy, and James Melville. See their original paper for a long-form description (https://arxiv.org/pdf/1802.03426.pdf). Also see the documentation for the original Python implementation (https://umap-learn.readthedocs.io/en/latest/index.html).

This MATLAB implementation follows a very similar structure to the Python implementation, and many of the function descriptions are nearly identical.

Here are some major differences in this MATLAB implementation:

1) The MATLAB function eigs.m does not appear to be as fast as the function "eigsh" in the Python package Scipy. For large data sets, we initialize a low-dimensional transform by binning the data using an algorithm known as probability binning. If the user downloads and installs the function lobpcg.m, made available here (https://www.mathworks.com/matlabcentral/fileexchange/48-locally-optimal-block-preconditioned-conjugate-gradient) by Andrew Knyazev, this can be used to find exact eigenvectors for medium-sized data sets. We also give you the option of downloading our slightly altered version of lobpcg.m, which has equivalent results.

2) We have built in the optional ability to detect clusters in the low-dimensional output of UMAP. The clustering method we invoke is either DBM (described at https://www.hindawi.com/journals/abi/2009/686759/ ) for 2D reductions or DBSCAN (built in to MATLAB R2019a and later) for any sized reduction. This produces cluster ID output and visualizations as explained in the code examples.

3) We also have built in tools to quantify and visualize the difference between data groups. Data groups can be defined either by the clusters on UMAP’s reduction (described above) or by the classification labels which UMAP uses for supervised reductions or supervised template reductions. We use a change quantification metric which detects similarity in both mass & distance (described at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5818510/) as well as a score for measuring overlap when the groups are different classifications for the same data (described at https://en.wikipedia.org/wiki/F-score). For visualizing data groups we provide a dendrogram (described as QF tree at https://www.nature.com/articles/s42003-019-0467-6) and sortable tables which show each data group’s similarity, overlap, false positive % and false negative %. In version 2.2 we added “UMAP dimension explorer” (UDE). UDE is a sortable table that shows characteristics of a data group’s unreduced data in each input dimension. These characteristics include the Kullback-Leibler divergence (KLD); the distribution as a density bar (colored using MATLAB’s jet colormap); and median, mean, SD and MAD. UDE supports data groups drawn by a MATLAB ROI tool (region of interest) on the UMAP output plot.

Overall, this MATLAB UMAP implementation tends to be faster than the current Python implementation (version 0.5.1 of umap-learn). All UMAP reductions are made faster with C++ MEX implementations. Due to File Exchange requirements, we only supply the C++ source code for the MEX modules. Users must download or build the .MEX binary files themselves separately (the option to download or build the files is provided upon calling "run_umap"). As examples 13 to 15 show, you can test the speed difference between the implementations for yourself on your computer by setting the 'python' argument to true.

Additionally, users of supervised templates may request the post reduction services of supervisor matching, QF tree, and QF dissimilarity. The function run_umap.m returns the results of these services via the new 4th output argument: extras. The properties of extras are documented in the file umap/UMAP_extra_results.m.

Optional toolbox dependencies:

-The Bioinformatics Toolbox is required to change the 'qf_tree' argument.

-The Curve Fitting Toolbox is required to change the 'min_dist' argument.

This implementation is a work in progress. It has been looked over by Leland McInnes, who considers it "a fairly faithful direct translation of the original Python code". We hope to continue improving it in the future.

Provided by the Herzenberg Lab at Stanford University.

We appreciate all and any help in finding bugs. Our priority has been determining the suitability of our concepts for research publications in flow cytometry for the use of UMAP supervised templates.

Connor Meehan, Jonathan Ebrahimian, Wayne Moore, and Stephen Meehan (2021). Uniform Manifold Approximation and Projection (UMAP) (https://www.mathworks.com/matlabcentral/fileexchange/71902), MATLAB Central File Exchange.

Created with
R2021a

Compatible with R2017a to R2021a

**Inspired:**
CytoMAP

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!Create scripts with code, output, and formatted text in a single executable document.

Gen OhtsukiRyuichi NakaharaStephen MeehanMy last comment was too quick ... if you are not using a metric that supports a dist_args then UMAP should not call pdist() with the dist_args argument. We will have that fixed today and you can email me for a link if you need it before the next release to File Exchange (hopefully next week). Thanks for helping us see this bug Nadou.

NadouHello,

I tried to use run_umap with a user defined distance function. I followed the recommandations to define the distance function as descrided in lines 96-108 in run_umap.m

But I had errors when runing the code. The same errors I found with the example given by KnnFind.ExampleDistFunc.

This is my code line

[reduction]=run_umap(X,'n_neighbors',20, 'Distance',@KnnFind.ExampleDistFunc,'n_components',2,'min_dist',0.1);

These are the errors :

Error using pdist (line 371)

Error evaluating distance function 'KnnFind.ExampleDistFunc'.

Error in UMAP/fit (line 560)

dmat = squareform(pdist(X, U.metric, U.dist_args));

Error in UMAP/fit_transform (line 674)

U = fit(U, X, y);

Error in run_umap (line 1442)

reduction = umap.fit_transform(inData);

Caused by:

Error using KnnFind.ExampleDistFunc

Too many input arguments.

I would appreciate if you help me

Best regards,

Nadia

Stephen MeehanHi John Smith concerning further discussion about parametric UMAP and the adjacency graph issues ... please email myself and Connor Meehan using the addresses in run_umap. m. Thanks ... this is fun.

John SmithTwo small bugs when using seuclidean distance:

1.in run_umap line 1351 should be 'euclidean' not 'euclidian'

2.in UMAP line 495: you cannot pass to pdist an empty U.dist_args (it fails).

You need to check whether it's empty and then not pass it:

if ~isempty(U.dist_args)

dmat = squareform(pdist(X, U.metric, U.dist_args));

else

dmat = squareform(pdist(X, U.metric));

end

or you should initialize it to the default yourself.

Stephen MeehanVery helpful 3 comments John ... will work on these this week

John SmithWith my data I'm getting the following warning:

Warning: The adjacency graph is not connected!

> In spectral_layout (line 35)

In simplicial_set_embedding (line 122)

In UMAP/fit (line 586)

In UMAP/fit_transform (line 606)

In run_umap (line 1422)

and the embedding has weird magnitude (e.g., 1e15).

I hacked the code to use the init=random option of UMAP and the problem's gone.

Two questions:

1.Why is this happening with init=spectral option?

2.Can the 'init' option be made available to run_umap?

Thx

John SmithOK, problem is size of data points is set to 1 in DrawLabeled in ProbabilityDensity2.m which makes them almost invisible. Unfortunately, this is NOT a parameters passed in, so it's not possible to change without hacking the code.

John SmithHi Stephen,

Matlab has now the Deep Learning Toolbox (have been using this for the past two years) with many of the same functionality that Tensorflow has. For straightforward nets, which is what UMAP employs IMHO, it's pretty much straightforward to switch to it or better yet have hooks/callbacks/helpers/wrappers for users to provide their own nets. One thing to notice is that as of version 2021a DLT still does NOT support 2nd order automatic differentiation so you cannot use AD to differentiate an arbitrary function and then use THAT function in your loss function (there's no equivalent to tf.gradients or autograd.grad). Also, for cutom training loops one needs the dlarray machinery which is much more flexible, but with less prebuilt tools.

Stephen MeehanHi John. Have not ported the other variants yet ... nor do I know of ports ... but I was looking into Parametric yesterday.. would need flowTensor stuff for neural net ... assume Matlab has this somewhere ...

.... will discuss with Connor next week.

John SmithHi,

Have you (or do you know of) also ported the other UMAP variants (Parametric, dense)?

Thx

Stephen MeehanThanks John Smith. Fixes to this small bug and others reported here are part of our v2.1 release on Jan 23, 2021.

John SmithTerrific tool!

There's a small bug wherin there's a crash when a single point is given for transform because find doesn't return a column vector in its third output. It's easy to fix: just add

U.weights = U.weights(:); % ensure weights is a column vector

after line 729 in UMAP.m.

Shixuan LiuJames CaiStephen MeehanThanks Michal. I SHOULD NOT have been so brief in my answer in the 2nd comment below this comment. A more complete answer to the (only) question I saw from you ("Which one is right?") would have been "NEITHER one is more right than the other one. If one were more right then we would NOT have mentioned both download options. Both are fine. Please note however that we only regression test UMAP with our version ... BUT I suspect Andrew's version has been run and tested many many many more times around the world".

And your new question in the comment below is helpful because our package description might have avoided this confusion had we added that that the creator's github version of lobpcg.m and ours are BOTH fine to use with our UMAP. Our version's modifications were attempts to improve which proved to be inconsequential to our UMAP. Thus the only way possible for users to get any future improvements to lobpcg itself is to know about the inventor's (Andrew Kynazev) github version. We do plan ultimately to replace both lobpcg as well as MATLAB's eigs function in favor of a C++ translation of eigsh found in Python SCIPY. But until then keep your eyes on Andrew Kynazev's github.

Thanks again for getting me to be more clear.

Michal KvasnickaThe lobpcg.m, available here (https://www.mathworks.com/matlabcentral/fileexchange/48-locally-optimal-block-preconditioned-conjugate-gradient) is not identical to the lobpcg.m available on your Google Drive. I verified this one again.

So I ask again: Why do you mention in your package description the possibility to download lobpcg.m (original version), which is available here (https://www.mathworks.com/matlabcentral/fileexchange/48-locally-optimal-block-preconditioned-conjugate-gradient) when you actually use another version of this function?

Stephen MeehanWith regards to Pedro's query below the reductions.mat file logic is for generating a unique integer ID for each UMAP reduction.

A quick&dirty HACK to avoid using a mat file I/O is to change line #745 in our latest uploaded run_umap.m. Currently this line reads

reduction=Map

Change this to

reductions=struct('load', @(s)disp(''), 'newId', floor(etime(clock, datevec('December 15 2020','mmmm dd yyyy'))), 'save', true);

Our own code that aims UMAP to run on a parallel cloud server does not use the run_umap wrapper but works directly with the UMAP class.

Stephen MeehanThe version of lobpcg.m on my Google drive is the one identical to what we tested run_umap with. if you type run_umap with no arguments are given the opportunity to download this and the mex accelerants directly.

Michal KvasnickaThere are two versions of lobpcg.m. First is available on BLOPEX (github) and second is available from run_umap Google drive by Stephen Meehan. Which one is the right one???

Pedro Javier Gómez GálvezGreat code!! Unfortunately I have to run the UMAP algorithm thousands of times due to my experiments and I can't use the parallel computing. One of the problems is relative to the recursive loading of "reductions.mat". Is there any way to solve this and speed up my computation? Thank you in advance.

Error using Map/load (line 250)

Cannot read file C:\Users\PedroPC\.run_umap\reductions.mat.

Error in run_umap (line 522)

reductions.load(fullfile(homeFolder, 'reductions.mat'));

Error in calculateUMAPValues (line 4)

[reduction{1,nIteration},umap]=run_umap(matrixChosenCcs,'verbose','none');

Error in UMAP_NDICIA (line 68)

[Proyections,eigenvectors,Ratio_UMAP]=calculateUMAPValues(matrixChosenCcs,nIteration,nImgType1,nImgType2,Proyections,eigenvectors,Ratio_UMAP,[cc1,cc2,cc3]);

Error in pipeline (line 48)

parfor nRand = 1 : nRandomizations

Corrado AmeliVery nice.

Beaware that in the documentation the "cluster_2D_method" arguement should be called "cluster_method_2D".

Dan O'SheaVery nice work! Had some issues getting this running on Linux that are fairly easily resolved:

in run_umap L815 or so, instead of using `if ismac`, just use `mexext` to get the mex extension, since you need "mexa64" on linux.

```

exe=fullfile(curPath, sprintf('mexStochasticGradientDescent.%s', mexext));

```

Similarly in InstallMexAndExe.m L10:

```

mexFileName= sprintf('mexStochasticGradientDescent.%s', mexext);

```

Everything worked from there on the examples!

Thanks,

Dan

Stephen Meehanrun_umap now can access fall resources and examples on the Herzenberg lab servers

Stephen MeehanCorrection to my comment below: run_umap can not access examples on the Herzeberg servers..

Hayley SongHi, thank you for sharing this library. It's very useful for my project. I noticed that currently 'run_umap' function errors out with 'metric' set to 'spearman', which comes from the absence of 'spearman' as key,value in `U.METRIC_DICT` in THE 'UMap' class definition. Adding (key, value) of ('spearman' and 'spearman') in `Umap.m` (Line 153-161) fixed the error.

It'd be great if you could incorporate this change in the next version. Thanks again for sharing the code!

Hayley SongFollowing up with my question below:

It seems like there was a minor mistake of missing 'spearman' in 'UMap' class definition. I was able to make it work with 'separman' metric by adding (key, value) of ('spearman' and 'spearman') in `Umap.m` (Line 153-161) to get the 'spearman'.

It'd be great if you could incorporate this change in the next version. Thanks again for sharing the code!

Ziwei LiuHi, I just downloaded this implementation and tried to run the example file (run_umap), but I got the following error:

Error using websave (line 98)

Could not access server. http://cgworkspace.cytogenie.org/GetDown2/demo/samples.zip.

Error in run_umap/downloadCsv (line 902)

websave(zipFile, ...

Error in run_umap (line 352)

csv_file_or_data=downloadCsv;

Could anyone help out with this? Thank you.

MlfanFrants JensenHi Stephen, I have the same issue as Cortexlab - would love to use pre-computed (non-euclidean) distance matrix. Do those changes he suggested (Metric-Dict option precomputed, and insert dmat later instead of calculating) make sense? Thanks! -Frants

Yuchun Dinghi, I'm trying to reduce from 100D to 2D. the size of the dataset is around 100k. I was wondering realistically how long should it take? using the default setting the performance seems really really slow

Michal KvasnickaUpdate 1.5.0 with default MEX files works very well ... thanks for your effort!!!

Richard GardnerThanks for sharing this implementation – it's working great for me so far. The only problem I've experienced is in reproducing results with non-default parameter sets, even when I set the 'randomize' argument to false. I wonder if this issue might originate from the curve-fitting process in find_ab_params.m. The fit() function appears to use a random initial state (I see a warning about this every time I execute run_umap.m), and this occurs before UMAP.m sets the random seed. Could this be a bug, or am I getting something wrong?

CortexlabHI, thanks so much for porting UMAP from Python to Matlab. I have a question about running UMAP with precomputed distance matrices (which are supported in the Python version). I believe these are almost supported by your code, but one needs to make two modifications in the file UMAP.m: (1) change METRIC_DICT so that it includes the option 'precomputed'; (2) prevent the code from calculating dmat in that case (dmat is what the user passes). I *think* that by making these two changes I got things to work but would you give me your opinion on whether this makes sense? Many thanks

-Matteo

BinxuBiaobin JiangTristan WießallaStephen MeehanBoth exceptions that Bryan Bates found were reproduced and fixed this week on Feb 21 in update 1.4.1

Stephen MeehanThanks Bryan Bates. I suspect run_umap does need more testing of combinations of input parameters. Please send details to me at swmeehan@stanford.edu. I need the input files that file plus an exact copy of the command you type. I look forward to getting a fix to you quick. Thanks again.

Bryan BatesHi there! So far this function is awesome and has helped my project loads! However there are a few more bugs that keep appearing that I'm having some trouble squashing. When adding a 'label_column' input argument to run_map() function (and ensuring that my last column of my input data has the labels), I get the following error:

"

Matrix index is out of range for deletion.

Error in run_umap (line 611)

parameter_names(args.label_column)=[];

611 parameter_names(args.label_column)=[];

"

I thought that simply commenting this out was enough of a fix, but then after UMAP runs, right before the last plotting execution, I get the next error:

"

Dot indexing is not supported for variables of this type.

Error in run_umap/updatePlot (line 938)

umap.supervisors.prepareForTemplate;

Error in run_umap (line 822)

updatePlot(reduction, true)

938 umap.supervisors.prepareForTemplate;

"

Could you guys help out with this? Thanks!

Stephen MeehanHi Michal. Thanks for your comment. Our overview indicates that the run_umap.m file is the starting place for effectively using this package. If you type "doc run_umap" on the command line AFTER downloading you see a similar extent of textual information to what you see when you type "doc tsne". Can you (or anyone) send us a "how to" link that documents comprehensively how to add additional tabs like "Examples" to file exchange so user can see the comprehensive documentation BEFORE they download? And is there a similar link explaining how to enrich documentation in m files to include pictures and web formatting. Sorry for not knowing this. Thanks again for your interest in improving our submission.

Michal KvasnickaI think that is really important to create some comprehensive documentation and/or tutorial with examples. Upgrade of UMAP 1.3.4 -> 1.4.0 significantly change whole UMAP concept (Python codes). I am really not sure, how to effectively use this package. I am just guessing ...

Stephen MeehanHi Mohammed,

We have updated the accepted metrics for UMAP in the latest update, 1.4.0. You can try running the new version and seeing if it fixes your problem.

If you are still receiving an error, would you mind sending us exactly what commands you are calling to receive this error so that we can try reproducing the error on our computers? You can e-mail it to us at swmeehan@stanford.edu or connor.gw.meehan@gmail.com.

Camden MacDowellReally appreciate this contribution. Thank you. Also easy to modify (logical flow). One occasional inconvenience is the restriction on the template file being a saved-off mat file with parameter names, etc. Easy workaround though: removed lines 405 - 422 e.g the two if/than checks for template_file parameters and replaced with

if ischar(template_file)

[umap, ~, canLoad, reOrgData]=Template.Get(inData, parameter_names, ...

template_file, 3);

else

umap = template_file;

canLoad = [];

reOrgData = [];

end

Messy but a quick fix. Now template_file can just be the umap variable when calling run_umap.

Iti Gov<a href="s">test</a>

Caleb StoltzfusMohammed Mostafizur RahmanWorks great. But i ran into an issue. I was running the algorithm, when it terminated midway. Next time whenever I run it, I get this error:

"Error using containers.Map/subsref

The specified key is not present in this container.

Error in UMAP/fit (line 340)

U.metric = U.METRIC_DICT(U.metric);

Error in UMAP/fit_transform (line 496)

U = fit(U, X, y);

Error in run_umap (line 542)

reduction = umap.fit_transform(inData);"

How to fix this? Thanks!

Rasmus BroThanks a lot. With the curve-fitting toolbox installed it works perfectly

Adam SciambiThis code is fantastic. Thanks for putting it together. I use it daily.

One error that I've encountered though is in function "smooth_knn_dist" around line 81, reproduced below.

rho = aug_dists(idx) + interpolation*(aug_dists(idx) - aug_dists(idx+height));

Sometimes "idx+height" is out of bounds of "aug_dists". Since "idx" itself is defined to go up to numel(aug_dists), this makes sense that it could go over when added to. I just put in a corrective factor shown below and it seems to work. At the edge case, it interpolates one column inward, rather than outward.

correction = zeros(size(idx));

correction(idx+height>numel(aug_dists)) = -height;

rho = aug_dists(idx) + interpolation*(aug_dists(idx+correction) - aug_dists(idx+height+correction));

jiaxin错误使用 -

矩阵维度必须一致。

出错 smooth_knn_dist (line 84)

d = distances(:,2:end) - rho;

出错 fuzzy_simplicial_set (line 108)

[sigmas, rhos] = smooth_knn_dist(knn_dists, n_neighbors, local_connectivity);

出错 UMAP/fit (line 420)

U.graph = fuzzy_simplicial_set(X, U.n_neighbors, randomState, U.metric,

'metric_kwds', U.metric_kwds,...

出错 UMAP/fit_transform (line 486)

U = fit(U, X, y);

出错 run_umap (line 495)

reduction = umap.fit_transform(inData);

jiaxinBeatriz MoyaThanks for the code, it's been very useful! However, I have tried to reduce the model to a 3-dimensional system, but I come up with this error:

Error using UMAP/validate_parameters (line 303)

The Java and C methods currently only support reducing to 2 dimensions

Error in UMAP/fit (line 358)

validate_parameters(U);

Error in UMAP/fit_transform (line 470)

U = fit(U, X, y);

Error in run_umap (line 441)

reduction = umap.fit_transform(inData);

When is this option going to be available?

JohnStephen MeehanHi ageorge and Rasmus,

We've looked into the error that you are both receiving. We realized that one of the MATLAB functions that we call, fit.m, actually requires the MATLAB Curve Fitting Toolbox (https://www.mathworks.com/products/curvefitting.html) and we mistakenly did not list this requirement on the download page. If you do not have the Curve Fitting Toolbox installed, this would explain the errors that you are receiving. We have now listed this requirement on the download page.

As a workaround for MATLAB users who do not have the Curve Fitting Toolbox, we have now hard-coded in values for the outputs of find_ab_params.m when the inputs have particular default inputs. In particular, all the examples in the documentation of run_umap.m should now run in the current version 1.2.1 without any problems for users without the Curve Fitting Toolbox.

Rasmus BroHi there

I am really interested in trying this, but I am also running into problems. I tried your updated version here and the one at your homepage. I get the following error (on matlab 2019a)

[reduction,umap] = run_umap(rand(10,100));

ans =

20

ans =

20

java.awt.Point[x=793,y=53] java.awt.Dimension[width=1146,height=1006]

DUDE [UMAP for 10x100

n\_neighbors=\color{blue}30\color{black}, min\_dist=\color{blue}0.3\color{black}, metric=\color{blue}euclidean\color{black},randomize=\color{blue}0\color{black}, labels=\color{blue}0

Undefined function 'fit' for input arguments of type 'function_handle'.

Error in find_ab_params (line 43)

f = fit(xv', yv', curve);

Error in UMAP/fit (line 352)

[U.a, U.b] = find_ab_params(U.spread, U.min_dist);

Error in UMAP/fit_transform (line 470)

U = fit(U, X, y);

Error in run_umap (line 441)

reduction = umap.fit_transform(inData);

Seng Bum YooThank you so much for the code. I wonder whether your set of codes includes re-embedding of new data to old embedding without modifying the old embeddings. Is init_transform relevant to that purpose?

One question: unless you change input parsing, it seems changing the 'n_epochs' are quite inflexible (I changed by myself). Like n_neighbor, for example, it would be great to have it as a free parameter.

ageorgeI'm getting the following error when run:

Undefined function 'fit' for input arguments of type 'function_handle'.

Error in find_ab_params (line 43)

f = fit(xv', yv', curve);

Error in UMAP/fit (line 352)

[U.a, U.b] = find_ab_params(U.spread, U.min_dist);

Error in UMAP/fit_transform (line 470)

U = fit(U, X, y);

Error in run_umap (line 441)

reduction = umap.fit_transform(inData);

Lucy DavisStephen MeehanHi Joanna,

Sorry for the delayed response to your issue. We have just uploaded a major update (version 1.2.0) that may resolve the issue, so try downloading the latest version and seeing if it is fixed! What was previously line 273 in version 1.1.0 should now be line 391 in 1.2.0.

If you are still receiving an error, would you mind sending us the full text of the exception so that we can investigate it further? We are having trouble reproducing the error. You can e-mail it to us at swmeehan@stanford.edu or connor.meehan@shaw.ca.

If you require a temporary workaround, we recommend downloading our UMAP distribution directly from our Web site at http://cgworkspace.cytogenie.org/GetDown2/demo/umapDistribution.zip. We are able to include some additional code in this distribution that does not meet File Exchange criteria. If an exception occurs with this version, it will switch to running the algorithm in C instead.

Joanna PolanskaI got the same error as Damon. Line 273: nTh=edu.stanford.facs.swing.Umap.EPOCH_REPORTS+3;

How to deal with it?

Thanks, Joanna

Henri JohanssonHow much slower is it than the python implementation?

Damon ClarkThanks for putting this together! Line 273 in run_umap throws an error for me -- it looks like it may be calling some local variable. I believe I had everything in the path correctly.

nTh=edu.stanford.facs.swing.Umap.EPOCH_REPORTS+3;