Given a set of high-dimensional data, run_umap.m produces a lower-dimensional representation of the data for purposes of data visualization and exploration. See the comments at the top of the file run_umap.m for documentation and many examples of how to use this code.
The UMAP algorithm is the invention of Leland McInnes, John Healy, and James Melville. See their original paper for a long-form description (https://arxiv.org/pdf/1802.03426.pdf). Also see the documentation for the original Python implementation (https://umap-learn.readthedocs.io/en/latest/index.html).
This MATLAB implementation follows a very similar structure to the Python implementation, and many of the function descriptions are nearly identical.
Here are some major differences in this MATLAB implementation:
1) The MATLAB function eigs.m does not appear to be as fast as the function "eigsh" in the Python package Scipy. For large data sets, we initialize a low-dimensional transform by binning the data using an algorithm known as probability binning. If the user downloads and installs the function lobpcg.m, made available here (https://www.mathworks.com/matlabcentral/fileexchange/48-locally-optimal-block-preconditioned-conjugate-gradient) by Andrew Knyazev, this can be used to find exact eigenvectors for medium-sized data sets. We also give you the option of downloading our slightly altered version of lobpcg.m, which has equivalent results.
2) We have built in the optional ability to detect clusters in the low-dimensional output of UMAP. The clustering method we invoke is either DBM (described at https://www.hindawi.com/journals/abi/2009/686759/ ) for 2D reductions or DBSCAN (built in to MATLAB R2019a and later) for any sized reduction. This produces cluster ID output and visualizations as explained in the code examples.
3) We also have built in tools to quantify and visualize the difference between data groups. Data groups can be defined either by the clusters on UMAP’s reduction (described above) or by the classification labels which UMAP uses for supervised reductions or supervised template reductions. We use a change quantification metric which detects similarity in both mass & distance (described at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5818510/) as well as a score for measuring overlap when the groups are different classifications for the same data (described at https://en.wikipedia.org/wiki/F-score). For visualizing data groups we provide a dendrogram (described as QF tree at https://www.nature.com/articles/s42003-019-0467-6) and sortable tables which show each data group’s similarity, overlap, false positive % and false negative %. In version 2.2 we added “UMAP dimension explorer” (UDE). UDE is a sortable table that shows characteristics of a data group’s unreduced data in each input dimension. These characteristics include the Kullback-Leibler divergence (KLD); the distribution as a density bar (colored using MATLAB’s jet colormap); and median, mean, SD and MAD. UDE supports data groups drawn by a MATLAB ROI tool (region of interest) on the UMAP output plot.
Overall, this MATLAB UMAP implementation tends to be faster than the current Python implementation (version 0.5.1 of umap-learn). All UMAP reductions are made faster with C++ MEX implementations. Due to File Exchange requirements, we only supply the C++ source code for the MEX modules. Users must download or build the .MEX binary files themselves separately (the option to download or build the files is provided upon calling "run_umap"). As examples 13 to 15 show, you can test the speed difference between the implementations for yourself on your computer by setting the 'python' argument to true.
Additionally, users of supervised templates may request the post reduction services of supervisor matching, QF tree, and QF dissimilarity. The function run_umap.m returns the results of these services via the new 4th output argument: extras. The properties of extras are documented in the file umap/UMAP_extra_results.m.
Optional toolbox dependencies:
-The Bioinformatics Toolbox is required to change the 'qf_tree' argument.
-The Curve Fitting Toolbox is required to change the 'min_dist' argument.
This implementation is a work in progress. It has been looked over by Leland McInnes, who considers it "a fairly faithful direct translation of the original Python code". We hope to continue improving it in the future.
Provided by the Herzenberg Lab at Stanford University.
We appreciate all and any help in finding bugs. Our priority has been determining the suitability of our concepts for research publications in flow cytometry for the use of UMAP supervised templates.
Connor Meehan, Jonathan Ebrahimian, Wayne Moore, and Stephen Meehan (2021). Uniform Manifold Approximation and Projection (UMAP) (https://www.mathworks.com/matlabcentral/fileexchange/71902), MATLAB Central File Exchange.
Find the treasures in MATLAB Central and discover how the community can help you!Start Hunting!