Contenu principal

Analyze Large Data in Database Using MapReduce

This example determines the mean arrival delay of a large set of flight data that is stored in a database. You can access large data sets using a databaseDatastore object with Database Toolbox™. After creating a DatabaseDatastore object, you can write a MapReduce algorithm that defines the chunking and reduction of the data. Alternatively, you can use a tall array to run algorithms on large data sets.

The DatabaseDatastore object does not support using a parallel pool with Parallel Computing Toolbox™ installed. To analyze data using tall arrays or run MapReduce algorithms, set the global execution environment to be the local MATLAB® session.

This example uses a preconfigured JDBC data source to create the database connection. For more information, see the configureJDBCDataSource function.

Create DatabaseDatastore Object

Set the global execution environment to be the local MATLAB® session.

mapreducer(0);

The file airlinesmall.csv contains a large set of flight data. Load this file into the Microsoft® SQL Server® database table airlinesmall. This table contains 123,523 records.

Create a database connection to the JDBC data source MSSQLServerJDBCAuth. This data source configures a JDBC driver to a Microsoft® SQL Server® database with Windows® authentication. Specify a blank user name and password.

datasource = "MSSQLServerJDBCAuth";
username = "";
password = "";
conn = database(datasource,username,password);

Create a DatabaseDatastore object using the database connection and an SQL query. This SQL query retrieves arrival-delay data from the airlinesmall table.

sqlquery = 'select ArrDelay from airlinesmall';

dbds = databaseDatastore(conn,sqlquery);

Define Mapper and Reducer Functions

To process large data sets in chunks, you can write your own mapper function. For each chunk in this example, use meanArrivalDelayMapper.m to:

  • Read arrival-delay data from the DatabaseDatastore object.

  • Determine the number of delays and the total delay in the chunk.

  • Store both values in KeyValueDatastore.

The meanArrivalDelayMapper.m file contains this code.

function meanArrivalDelayMapper (data, info, intermKVStore)
% Mapper function for the MeanMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

% Data is an n-by-1 table of the ArrDelay. Remove missing value first:
data(isnan(data.ArrDelay),:) = [];

% Record the partial counts and sums and the reducer will accumulate them.
partCountSum = [length(data.ArrDelay), sum(data.ArrDelay)];
add(intermKVStore, 'PartialCountSumDelay',partCountSum);

You also can write your own reducer function. In this example, use meanArrivalDelayReducer.m to read intermediate values for the number of delays and the total arrival delay. Then, determine the overall mean arrival delay. mapreduce calls this reducer function only once because the mapper function adds just one key to KeyValueStore. The meanArrivalDelayReducer.m file contains this code.

function meanArrivalDelayReducer(intermKey, intermValIter, outKVStore)
% Reducer function for the MeanMapReduceExample.

% Copyright 2014 The MathWorks, Inc.

% intermKey is 'PartialCountSumDelay'
count = 0;
sum = 0;
while hasnext(intermValIter)
   countSum = getnext(intermValIter);
   count = count + countSum(1);
   sum = sum + countSum(2);
end

meanDelay = sum/count;

% The key-value pair added to outKVStore will become the output of mapreduce 
add(outKVStore,'MeanArrivalDelay',meanDelay);

Run MapReduce Using Mapper and Reducer Functions

To determine the mean arrival delay in the flight data, run MapReduce with the DatabaseDatastore object, mapper function, and reducer function.

outds = mapreduce(dbds,@meanArrivalDelayMapper,@meanArrivalDelayReducer);
********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  15% Reduce   0%
Map  30% Reduce   0%
Map  46% Reduce   0%
Map  61% Reduce   0%
Map  76% Reduce   0%
Map  92% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

Display Output from MapReduce

Read the table from the output datastore using readall.

outtab = readall(outds)
outtab =

  1×2 table

           Key             Value  
    __________________    ________

    'MeanArrivalDelay'    [7.1201]

The table has only one row containing one key-value pair.

Display the mean arrival delay from the table.

meanArrDelay = outtab.Value{1}
meanArrDelay =

    7.1201

Close DatabaseDatastore Object and Database Connection

close(dbds)

See Also

| | | | | |

Topics

External Websites