Run Standalone MATLAB MapReduce Application
Supported Platform: Linux® only.
This example shows you how to create a standalone MATLAB® MapReduce application using the mcc command and
run it against a Hadoop® cluster.
Goal: Calculate the maximum arrival delay of an airline from the given dataset.
| Dataset: | airlinesmall.csv |
| Description: |
Airline departure and arrival information from 1987-2008. |
| Location: | To download the setupExample("matlab/AddKeysValuesExample", pwd)AddKeysValuesExample.mlx live script file
that is automatically downloaded along with the
airlinesmall.csv file. |
Prerequisites
Start this example by creating a new work folder that is visible to the MATLAB search path.
Before starting MATLAB, at a terminal, set the environment variable
HADOOP_PREFIXto point to the Hadoop installation folder. For example:Shell Command csh / tcsh % setenv HADOOP_PREFIX /usr/lib/hadoop
bash $ export HADOOP_PREFIX=/usr/lib/hadoop
Note
This example uses
/usr/lib/hadoopas directory where Hadoop is installed. Your Hadoop installation directory maybe different.If you forget setting the
HADOOP_PREFIXenvironment variable prior to starting MATLAB, set it up using the MATLAB functionsetenvat the MATLAB command prompt as soon as you start MATLAB. For example:setenv('HADOOP_PREFIX','/usr/lib/hadoop')
Install the MATLAB Runtime in a folder that is accessible by every worker node in the Hadoop cluster. This example uses
/usr/local/MATLAB/MATLAB_Runtime/R2025bas the location of the MATLAB Runtime folder.If you don’t have the MATLAB Runtime, you can download it from the website at:
https://www.mathworks.com/products/compiler/mcr.Note
For information about MATLAB Runtime version numbers corresponding MATLAB releases, see this list.
Copy the map function
maxArrivalDelayMapper.mfrom/usr/local/MATLAB/R2025b/toolbox/matlab/demosfolder to the work folder.For more information, see Write a Map Function.
Copy the reduce function
maxArrivalDelayReducer.mfromfolder to the work folder.matlabroot/toolbox/matlab/demosFor more information, see Write a Reduce Function.
Create the directory
/user/on HDFS™ and copy the file<username>/datasetsairlinesmall.csvto that directory. Hererefers to your user name in HDFS.<username>$ ./hadoop fs -copyFromLocal airlinesmall.csv hdfs://host:54310/user/<username>/datasets
Procedure
Start MATLAB and verify that the
HADOOP_PREFIXenvironment variable has been set. At the command prompt, type:>> getenv('HADOOP_PREFIX')If
ansis empty, review the Prerequisites section above to see how you can set theHADOOP_PREFIXenvironment variable.Create a new MATLAB script with the name
depMapRedStandAlone.m. You will add the code listed in the steps listed below to this script file.Create a
datastorethat points to the airline data in Hadoop Distributed File System (HDFS).ds = datastore('hdfs:///user/username/datasets/airlinesmall.csv',... 'TreatAsMissing','NA',... 'SelectedVariableNames',{'UniqueCarrier','ArrDelay'});
For more information, see Work with Remote Data.
Configure the application for deployment against Hadoop with default settings.
config = matlab.mapreduce.DeployHadoopMapReducer;
The class
matlab.mapreduce.DeployHadoopMapReducercan be used to configure a standalone application based on the Hadoop environment where it is going to be deployed.For example, if you want to specify the location of the MATLAB Runtime on each of the worker nodes on the cluster, include a line of code similar to this:
In this scenario, we assume that the MATLAB Runtime is installed in a non-default location such asconfig = matlab.mapreduce.DeployHadoopMapReducer('MCRRoot','/opt/MATLAB/MATLAB_Runtime/R2025b');/opt/MATLAB/MATLAB_Runtimeon the worker nodes.For information on specifying additional cluster specific properties, see
matlab.mapreduce.DeployHadoopMapReducer.Note
Specifying a MATLAB Runtime location as part of the class
matlab.mapreduce.DeployHadoopMapReducerwill override any MATLAB Runtime location specified during the execution of the standalone application.Define the execution environment using the
mapreducer.mr = mapreducer(config);
Apply the
mapreducefunction.result = mapreduce(... ds,... @maxArrivalDelayMapper,@maxArrivalDelayReducer,... mr,... 'OutputType','Binary', ... 'OutputFolder','hdfs:///user/<username>/results/myresults');
Note
An HDFS directory such as
.../myresultscan be written to only once. If you plan on running your standalone application multiple times against the Hadoop cluster, make sure you delete the.../myresultsdirectory on HDFS prior to each execution. Another option is to change the name of the.../myresultsdirectory in the MATLAB code and recompile the application.Read the result from the resulting datastore.
myAppResult = readall(result)
Use the
mcccommand with the-mflag to create a standalone application.mcc -m depMapRedStandAlone.m
The
-mflag creates a standard executable that can be run from a command line. However, themcccommand cannot package the results in an installer.Run the standalone application from a Linux shell using the following command:
$ ./run_depMapRedStandAlone.sh /usr/local/MATLAB/MATLAB_Runtime/R2025b/usr/local/MATLAB/MATLAB_Runtime/R2025bis an argument indicating the location of the MATLAB Runtime.Prior to executing the above command, verify that the
HADOOP_PREFIXenvironment variable is set in the Terminal by typing:If$ echo $HADOOP_PREFIX
echocomes up empty, see the Prerequisites section above to see how you can set theHADOOP_PREFIXenvironment variable.Your application will fail to execute if the
HADOOP_PREFIXenvironment variable is not set.You will see the following output:
myAppResult = Key Value _________________ ______ 'MaxArrivalDelay' [1014]
To learn more about using the map and
reduce functions, see Getting Started with MapReduce.
Complete code for the standalone application
depMapRedStandAlone can be found here:
See Also
datastore | TabularTextDatastore | KeyValueDatastore | matlab.mapreduce.DeployHadoopMapReducer | mcc
