This example shows how to perform statistical analysis and machine learning on out-of-memory data with MATLAB® and Statistics and Machine Learning Toolbox™.
Tall arrays and tables are designed for working with out-of-memory data. This type of data consists of a very large number of rows (observations) compared to a smaller number of columns (variables). Instead of writing specialized code that takes into account the huge size of the data, such as with MapReduce, tall arrays let you work with large data sets in a manner similar to in-memory MATLAB arrays. The fundamental difference is that tall arrays typically remain unevaluated until you request that the calculations be performed.
When you execute calculations on tall arrays, the default execution environment uses either the local MATLAB session or a local parallel pool (if you have Parallel Computing Toolbox™). You can use the mapreducer
function to change the execution environment.
This example works with a subset of data on a single computer to develop a linear regression model, and then it scales up to analyze all of the data set. You can scale up this analysis even further to:
Work with data that cannot be read into memory
Work with data distributed across clusters using MATLAB Distributed Computing Server™
Integrate with big data systems like Hadoop® and Spark®
Several unsupervised and supervised learning algorithms in Statistics and Machine Learning Toolbox are available to work with tall arrays to perform data mining and predictive modeling with out-of-memory data. These algorithms are appropriate for out-of-memory data and can include slight variations from the in-memory algorithms. Capabilities include:
k-Means clustering
Linear regression
Generalized linear regression
Logistic regression
Discriminant analysis
The machine learning workflow for out-of-memory data in MATLAB is similar to in-memory data:
Preprocess
Explore
Develop model
Validate model
Scale up to larger data
This example follows a similar structure in developing a predictive model for airline delays. The data includes a large file of airline flight information from 1987 through 2008. The example goal is to predict the departure delay based on a number of variables.
Details on the fundamental aspects of tall arrays are included in the example Analyze Big Data in MATLAB Using Tall Arrays (MATLAB). This example extends the analysis to include machine learning with tall arrays.
A datastore is a repository for collections of data that are too large to fit in memory. You can create a datastore from a number of different file formats as the first step to create a tall array from an external data source.
Create a datastore for the sample file airlinesmall.csv
. Select the variables of interest, treat 'NA'
values as missing data, and generate a preview table of the data.
ds = datastore(fullfile(matlabroot,'toolbox','matlab','demos','airlinesmall.csv')); ds.SelectedVariableNames = {'Year','Month','DayofMonth','DayOfWeek',... 'DepTime','ArrDelay','DepDelay','Distance'}; ds.TreatAsMissing = 'NA'; pre = preview(ds)
pre = 8×8 table Year Month DayofMonth DayOfWeek DepTime ArrDelay DepDelay Distance ____ _____ __________ _________ _______ ________ ________ ________ 1987 10 21 3 642 8 12 308 1987 10 26 1 1021 8 1 296 1987 10 23 5 2055 21 20 480 1987 10 23 5 1332 13 12 296 1987 10 22 4 629 4 -1 373 1987 10 28 3 1446 59 63 308 1987 10 8 4 928 3 -2 447 1987 10 10 6 859 11 -1 954
Create a tall table backed by the datastore to facilitate working with the data. The underlying data type of a tall array depends on the type of datastore. In this case, the datastore is tabular text and returns a tall table. The display includes a preview of the data, with indication that the size is unknown.
tt = tall(ds)
Starting parallel pool (parpool) using the 'local' profile ... connected to 6 workers. tt = M×8 tall table Year Month DayofMonth DayOfWeek DepTime ArrDelay DepDelay Distance ____ _____ __________ _________ _______ ________ ________ ________ 1987 10 21 3 642 8 12 308 1987 10 26 1 1021 8 1 296 1987 10 23 5 2055 21 20 480 1987 10 23 5 1332 13 12 296 1987 10 22 4 629 4 -1 373 1987 10 28 3 1446 59 63 308 1987 10 8 4 928 3 -2 447 1987 10 10 6 859 11 -1 954 : : : : : : : : : : : : : : : :
This example aims to explore the time of day and day of week in more detail. Convert the day of week to categorical data with labels and determine the hour of day from the numeric departure time variable.
tt.DayOfWeek = categorical(tt.DayOfWeek,1:7,{'Sun','Mon','Tues',... 'Wed','Thu','Fri','Sat'}); tt.Hr = discretize(tt.DepTime,0:100:2400,0:23)
tt = M×9 tall table Year Month DayofMonth DayOfWeek DepTime ArrDelay DepDelay Distance Hr ____ _____ __________ _________ _______ ________ ________ ________ __ 1987 10 21 Tues 642 8 12 308 6 1987 10 26 Sun 1021 8 1 296 10 1987 10 23 Thu 2055 21 20 480 20 1987 10 23 Thu 1332 13 12 296 13 1987 10 22 Wed 629 4 -1 373 6 1987 10 28 Tues 1446 59 63 308 14 1987 10 8 Wed 928 3 -2 447 9 1987 10 10 Fri 859 11 -1 954 8 : : : : : : : : : : : : : : : : : :
Include only years after 2000 and ignore rows with missing data. Identify data of interest by logical condition.
idx = tt.Year >= 2000 & ...
~any(ismissing(tt),2);
tt = tt(idx,:);
A number of exploratory functions are available for tall arrays. For a list of supported statistics functions, see Tall Array Support, Usage Notes, and Limitations.
The grpstats
function calculates grouped statistics of tall arrays. Explore the data by determining the centrality and spread of the data with summary statistics grouped by day of week. Also, explore the correlation between the departure delay and arrival delay.
g = grpstats(tt(:,{'ArrDelay','DepDelay','DayOfWeek'}),'DayOfWeek',... {'mean','std','skewness','kurtosis'})
g = M×11 tall table GroupLabel DayOfWeek GroupCount mean_ArrDelay std_ArrDelay skewness_ArrDelay kurtosis_ArrDelay mean_DepDelay std_DepDelay skewness_DepDelay kurtosis_DepDelay __________ _________ __________ _____________ ____________ _________________ _________________ _____________ ____________ _________________ _________________ ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? : : : : : : : : : : : : : : : : : : : : : :
C = corr(tt.DepDelay,tt.ArrDelay)
C = M×N×... tall array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : :
These commands produce more tall arrays. The commands are not executed until you explicitly gather the results into the workspace. The gather
command triggers execution and attempts to minimize the number of passes required through the data to perform the calculations. gather
requires that the resulting variables fit into memory.
[statsByDay,C] = gather(g,C)
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 14 sec Evaluation completed in 24 sec statsByDay = 7×11 table GroupLabel DayOfWeek GroupCount mean_ArrDelay std_ArrDelay skewness_ArrDelay kurtosis_ArrDelay mean_DepDelay std_DepDelay skewness_DepDelay kurtosis_DepDelay __________ _________ __________ _____________ ____________ _________________ _________________ _____________ ____________ _________________ _________________ 'Wed' Wed 8489 9.3324 37.406 5.1638 57.479 10 33.426 6.4336 85.426 'Fri' Fri 7339 4.1512 32.1 7.082 120.53 7.0857 29.339 8.9387 168.37 'Sat' Sat 8045 7.132 33.108 3.6457 22.991 9.1557 29.731 4.5135 31.228 'Tues' Tues 8381 6.4786 32.322 4.374 38.694 7.6083 28.394 5.2012 46.249 'Mon' Mon 8443 5.2487 32.453 4.5811 37.175 6.8319 28.573 5.6468 50.271 'Sun' Sun 8570 7.7515 36.003 5.7943 80.91 9.3324 32.516 7.2146 118.25 'Thu' Thu 8601 10.053 36.18 4.1381 37.051 10.923 34.708 1.1414 138.38 C = 0.8966
The variables containing the results are now in-memory variables in the Workspace. Based on these calculations, variation occurs in the data and there is correlation between the delays that you can investigate further.
Explore the effect of day of week and hour of day and gain additional statistical information such as the standard error of the mean and the 95% confidence interval for the mean. You can pass the entire tall table and specify which variables to perform calculations on.
byDayHr = grpstats(tt,{'Hr','DayOfWeek'},... {'mean','sem','meanci'},'DataVar','DepDelay'); byDayHr = gather(byDayHr);
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 2.5 sec Evaluation completed in 5.7 sec
Due to the data partitioning of the tall array, the output might be unordered. Rearrange the data in memory for further exploration.
x = unstack(byDayHr(:,{'Hr','DayOfWeek','mean_DepDelay'}),... 'mean_DepDelay','DayOfWeek'); x = sortrows(x)
x = 24×8 table Hr Sun Mon Tues Wed Thu Fri Sat __ _______ ________ ________ _______ _______ _______ _______ 0 38.519 71.914 39.656 34.667 90 25.536 65.579 1 45.846 27.875 93.6 125.23 52.765 38.091 29.182 2 NaN 39 102 NaN 78.25 -1.5 NaN 3 NaN NaN NaN NaN -377.5 53.5 NaN 4 -7 -6.2857 -7 -7.3333 -10.5 -5 NaN 5 -2.2409 -3.7099 -4.0146 -3.9565 -3.5897 -3.5766 -4.1474 6 0.4 -1.8909 -1.9802 -1.8304 -1.3578 0.84161 -2.2537 7 3.4173 -0.47222 -0.18893 0.71546 0.08 1.069 -1.3221 8 2.3759 1.4054 1.6745 2.2345 2.9668 1.6727 0.88213 9 2.5325 1.6805 2.7656 2.683 5.6138 3.4838 2.5011 10 6.37 5.2868 3.6822 7.5773 5.3372 6.9391 4.9979 11 6.9946 4.9165 5.5639 5.5936 7.0435 4.8989 5.2839 12 5.673 5.1193 5.7081 7.9178 7.5269 8.0625 7.4686 13 8.0879 7.1017 5.0857 8.8082 8.2878 8.0675 6.2107 14 9.5164 5.8343 7.416 9.5954 8.6667 6.0677 8.444 15 8.1257 4.8802 7.4726 9.8674 10.235 7.167 8.6219 16 12.302 7.4968 11.406 12.413 12.874 10.962 12.908 17 11.47 8.9495 10.658 12.961 13.487 7.9034 8.9327 18 15.148 13.849 11.266 15.406 16.706 11.022 13.042 19 14.77 11.618 15.053 17.561 21.032 12.644 16.404 20 17.711 13.942 17.105 22.382 25.945 11.223 22.152 21 23.727 17.276 23.092 25.794 28.828 14.011 22.682 22 29.383 24.949 28.265 30.649 37.38 24.328 36.272 23 38.296 33.966 34.904 47.592 49.523 29.5 44.122
Currently, you can visualize tall array data using histogram
, histogram2
, binScatterPlot
, and ksdensity
. The visualizations all trigger execution, similar to calling the gather
function.
Use binScatterPlot
to examine the relationship between the Hr
and DepDelay
variables.
binScatterPlot(tt.Hr,tt.DepDelay,'Gamma',0.25) ylim([0 500]) xlabel('Time of Day') ylabel('Delay (Minutes)')
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 1.8 sec Evaluation completed in 2.2 sec Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 1.2 sec Evaluation completed in 1.5 sec
As noted in the output display, the visualizations often take two passes through the data: one to perform the binning, and one to perform the binned calculation and produce the visualization.
To develop a machine learning model, it is useful to reserve part of the data to train and develop the model and another part of the data to test the model. A number of ways exist for you to split the data into training and validation sets.
Use datasample
to obtain a random sampling of the data. Then use cvpartition
to partition the data into test and training sets. To obtain nonstratified partitions, set a uniform grouping variable by multiplying the data samples by zero.
rng(1234) data = datasample(tt,25000,'Replace',false); groups = 0*data.DepDelay; y = cvpartition(groups,'HoldOut',1/3);
dataTrain = data(training(y),:); dataTest = data(test(y),:);
Build a model to predict the departure delay based on several variables. The linear regression model function fitlm
behaves similarly to the in-memory function. However, calculations with tall arrays result in a CompactLinearModel
, which is more efficient for large data sets. Model fitting triggers execution because it is an iterative process.
model = fitlm(dataTrain,'ResponseVar','DepDelay')
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 2: Completed in 1.1 sec - Pass 2 of 2: Completed in 3.1 sec Evaluation completed in 5 sec model = Compact linear regression model: DepDelay ~ [Linear formula with 9 terms in 8 predictors] Estimated Coefficients: Estimate SE tStat pValue __________ __________ __________ __________ (Intercept) -16.821 99.942 -0.16831 0.86634 Year 0.0073777 0.049861 0.14797 0.88237 Month 0.011477 0.036855 0.31142 0.75549 DayofMonth 0.014358 0.014252 1.0074 0.31375 DayOfWeek_Mon -0.30433 0.46577 -0.65339 0.51351 DayOfWeek_Tues -0.35532 0.46616 -0.76222 0.44594 DayOfWeek_Wed -0.30464 0.46359 -0.65712 0.51111 DayOfWeek_Thu -0.39379 0.46549 -0.84597 0.39758 DayOfWeek_Fri 0.96788 0.47853 2.0226 0.04313 DayOfWeek_Sat -0.0045176 0.4705 -0.0096017 0.99234 DepTime 0.010517 0.0070886 1.4837 0.13791 ArrDelay 0.78984 0.0037274 211.9 0 Distance 0.00096613 0.00021912 4.4092 1.0439e-05 Hr -0.76579 0.70807 -1.0815 0.27948 Number of observations: 16668, Error degrees of freedom: 16654 Root Mean Squared Error: 16.3 R-squared: 0.739, Adjusted R-Squared 0.739 F-statistic vs. constant model: 3.63e+03, p-value = 0
The display indicates fit information, as well as coefficients and associated coefficient statistics.
The model
variable contains information about the fitted model as properties, which you can access using dot notation. Alternatively, double click the variable in the Workspace to explore the properties interactively.
model.Rsquared
ans = struct with fields: Ordinary: 0.7394 Adjusted: 0.7392
Predict new values based on the model, calculate the residuals, and visualize using a histogram. The predict
function predicts new values for both tall and in-memory data.
pred = predict(model,dataTest); err = pred - dataTest.DepDelay; figure histogram(err,'BinLimits',[-100 100],'Normalization','pdf') title('Histogram of Residuals')
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 2: Completed in 3 sec - Pass 2 of 2: Completed in 1.8 sec Evaluation completed in 5.5 sec
Looking at the output p-values in the display, some variables might be unnecessary in the model. You can reduce the complexity of the model by removing these variables.
Examine the significance of the variables in the model more closely using anova
.
a = anova(model)
a = 9×5 table SumSq DF MeanSq F pValue __________ _____ __________ ________ __________ Year 5.7861 1 5.7861 0.021894 0.88237 Month 25.63 1 25.63 0.096981 0.75549 DayofMonth 268.21 1 268.21 1.0149 0.31375 DayOfWeek 3082.5 6 513.74 1.944 0.069969 DepTime 581.76 1 581.76 2.2013 0.13791 ArrDelay 1.1867e+07 1 1.1867e+07 44902 0 Distance 5137.8 1 5137.8 19.441 1.0439e-05 Hr 309.12 1 309.12 1.1697 0.27948 Error 4.4012e+06 16654 264.28
Based on the p-values, the variables Year
, Month
, and DayOfMonth
are not significant to this model, so you can remove them without negatively affecting the model quality.
To explore these model parameters further, use interactive visualizations such as plotSlice
, plotInterations
, and plotEffects
. For example, use plotEffects
to examine the estimated effect that each predictor variable has on the departure delay.
plotEffects(model)
Based on these calculations, ArrDelay
is the main effect in the model (it is highly correlated to DepDelay
). The other effects are observable, but have much less impact. In addition, Hr
was determined from DepTime
, so only one of these variables is necessary to the model.
Reduce the number of variables to exclude all date components, and then fit a new model.
model2 = fitlm(dataTrain,'DepDelay ~ DepTime + ArrDelay + Distance')
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 2: Completed in 0.8 sec - Pass 2 of 2: Completed in 1.6 sec Evaluation completed in 3 sec model2 = Compact linear regression model: DepDelay ~ 1 + DepTime + ArrDelay + Distance Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) -1.5597 0.41143 -3.7909 0.00015064 DepTime 0.0028395 0.00026665 10.649 2.1499e-26 ArrDelay 0.7896 0.0037216 212.17 0 Distance 0.00097827 0.00021896 4.4678 7.9559e-06 Number of observations: 16668, Error degrees of freedom: 16664 Root Mean Squared Error: 16.3 R-squared: 0.739, Adjusted R-Squared 0.739 F-statistic vs. constant model: 1.57e+04, p-value = 0
Even with the model simplified, it can be useful to further adjust the relationships between the variables and include specific interactions. To experiment further, repeat this workflow with smaller tall arrays. For performance while tuning the model, you can consider working with a small extraction of in-memory data before scaling up to the entire tall array.
In this example, you can use functionality like stepwise regression, which is suited for iterative, in-memory model development. After tuning the model, you can scale up to use tall arrays.
Gather a subset of the data into the workspace and use stepwiselm
to iteratively develop the model in memory.
subset = gather(dataTest); sModel = stepwiselm(subset,'ResponseVar','DepDelay')
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 1.6 sec Evaluation completed in 1.6 sec 1. Adding ArrDelay, FStat = 44489.1181, pValue = 0 2. Adding DepTime, FStat = 64.5107, pValue = 1.09134e-15 3. Adding DepTime:ArrDelay, FStat = 79.4635, pValue = 5.95829e-19 4. Adding Distance, FStat = 22.2148, pValue = 2.4775e-06 5. Adding ArrDelay:Distance, FStat = 17.923, pValue = 2.325e-05 6. Adding DayOfWeek, FStat = 3.1389, pValue = 0.0044851 7. Adding DayOfWeek:ArrDelay, FStat = 29.9733, pValue = 9.17715e-36 8. Adding DayOfWeek:DepTime, FStat = 3.0534, pValue = 0.0055137 sModel = Linear regression model: DepDelay ~ [Linear formula with 9 terms in 4 predictors] Estimated Coefficients: Estimate SE tStat pValue ___________ __________ ________ __________ (Intercept) 0.15008 1.13 0.13281 0.89435 DayOfWeek_Mon 0.72667 1.5427 0.47102 0.63764 DayOfWeek_Tues -3.3907 1.5663 -2.1648 0.030428 DayOfWeek_Wed -0.31631 1.541 -0.20526 0.83737 DayOfWeek_Thu -4.0158 1.5493 -2.592 0.0095596 DayOfWeek_Fri -0.81587 1.6465 -0.49552 0.62025 DayOfWeek_Sat -0.16604 1.6284 -0.10196 0.91879 DepTime 0.00057715 0.00077477 0.74493 0.45633 ArrDelay 0.79802 0.013036 61.214 0 Distance 0.0013747 0.00025517 5.3875 7.339e-08 DayOfWeek_Mon:DepTime 0.00016891 0.0010783 0.15664 0.87553 DayOfWeek_Tues:DepTime 0.0026993 0.0011036 2.4459 0.014468 DayOfWeek_Wed:DepTime 0.00046082 0.0010937 0.42135 0.67351 DayOfWeek_Thu:DepTime 0.0034279 0.0010822 3.1674 0.0015436 DayOfWeek_Fri:DepTime 0.0021192 0.0011695 1.8121 0.070006 DayOfWeek_Sat:DepTime 0.0009044 0.0011414 0.79233 0.42819 DayOfWeek_Mon:ArrDelay -0.15255 0.013693 -11.141 1.2584e-28 DayOfWeek_Tues:ArrDelay -0.12582 0.01547 -8.1334 4.7767e-16 DayOfWeek_Wed:ArrDelay -0.12278 0.012798 -9.594 1.0971e-21 DayOfWeek_Thu:ArrDelay -0.11435 0.012723 -8.9876 3.085e-19 DayOfWeek_Fri:ArrDelay -0.097063 0.015136 -6.4128 1.5071e-10 DayOfWeek_Sat:ArrDelay -0.095672 0.014019 -6.8242 9.4604e-12 DepTime:ArrDelay 8.3134e-05 7.2183e-06 11.517 1.8403e-30 ArrDelay:Distance -1.7599e-05 6.3038e-06 -2.7918 0.0052537 Number of observations: 8332, Error degrees of freedom: 8308 Root Mean Squared Error: 12.8 R-squared: 0.85, Adjusted R-Squared 0.849 F-statistic vs. constant model: 2.04e+03, p-value = 0
The model that results from the stepwise fit includes interaction terms.
Now try to fit a model for the tall data by using fitlm
with the formula returned by stepwiselm
.
model3 = fitlm(dataTrain,sModel.Formula)
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 2: Completed in 1.3 sec - Pass 2 of 2: Completed in 1.7 sec Evaluation completed in 3.7 sec model3 = Compact linear regression model: DepDelay ~ [Linear formula with 9 terms in 4 predictors] Estimated Coefficients: Estimate SE tStat pValue ___________ __________ ________ ___________ (Intercept) 0.28541 0.96249 0.29653 0.76683 DayOfWeek_Mon -0.1619 1.3294 -0.12178 0.90307 DayOfWeek_Tues -0.55196 1.3384 -0.41239 0.68006 DayOfWeek_Wed -0.9566 1.3351 -0.71652 0.47368 DayOfWeek_Thu -5.3787 1.3403 -4.0132 6.0176e-05 DayOfWeek_Fri 1.3064 1.3843 0.94376 0.34531 DayOfWeek_Sat 0.28392 1.3768 0.20621 0.83663 DepTime 0.0010716 0.00067246 1.5935 0.11107 ArrDelay 0.57822 0.015753 36.706 1.0449e-283 Distance 0.0015624 0.00021365 7.3129 2.7336e-13 DayOfWeek_Mon:DepTime -0.00025604 0.00094276 -0.27158 0.78595 DayOfWeek_Tues:DepTime 0.00012529 0.00094869 0.13206 0.89494 DayOfWeek_Wed:DepTime 0.0001429 0.00094618 0.15102 0.87996 DayOfWeek_Thu:DepTime 0.0044608 0.00094871 4.7019 2.5978e-06 DayOfWeek_Fri:DepTime -0.00077639 0.00098963 -0.78452 0.43275 DayOfWeek_Sat:DepTime -0.00029618 0.00096815 -0.30593 0.75966 DayOfWeek_Mon:ArrDelay 0.029361 0.014033 2.0922 0.036432 DayOfWeek_Tues:ArrDelay 0.022535 0.014722 1.5307 0.12585 DayOfWeek_Wed:ArrDelay 0.041203 0.013489 3.0547 0.0022566 DayOfWeek_Thu:ArrDelay -0.10837 0.013343 -8.1224 4.889e-16 DayOfWeek_Fri:ArrDelay 0.16854 0.013793 12.219 3.4454e-34 DayOfWeek_Sat:ArrDelay -0.02333 0.014401 -1.62 0.10524 DepTime:ArrDelay 0.00017568 6.569e-06 26.743 2.6449e-154 ArrDelay:Distance -9.2387e-05 6.3825e-06 -14.475 3.3714e-47 Number of observations: 16668, Error degrees of freedom: 16644 Root Mean Squared Error: 15.6 R-squared: 0.759, Adjusted R-Squared 0.759 F-statistic vs. constant model: 2.28e+03, p-value = 0
You can repeat this process to continue to adjust the linear model. However, in this case, you should explore different types of regression that might be more appropriate for this data. For example, if you do not want to include the arrival delay, then this type of linear model is no longer appropriate. See Logistic Regression with Tall Arrays for more information.
A key capability of tall arrays in MATLAB and Statistics and Machine Learning Toolbox is the connectivity to platforms such as Hadoop and Spark. You can even compile the code and run it on Spark using MATLAB Compiler™. See Extend Tall Arrays with Other Products (MATLAB) for more information about using these products:
Database Toolbox™
Parallel Computing Toolbox™
MATLAB® Distributed Computing Server™
MATLAB Compiler™