Read and Analyze Large Tabular Text File
This example shows how to create a datastore for a large text file containing tabular data, and then read and process the data one block at a time or one file at a time.
Create a Datastore
Create a datastore from the sample file airlinesmall.csv
using the tabularTextDatastore
function. When you create the datastore, you can specify that the text, NA
, in the data is treated as missing data.
ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
You can modify the properties of the datastore by changing its properties. Modify the MissingValue
property to specify that missing values are treated as 0.
ds.MissingValue = 0;
In this example, select the variable for the arrival delay, ArrDelay
, as the variable of interest.
ds.SelectedVariableNames = 'ArrDelay';
Preview the data using the preview
function. This function does not affect the state of the datastore.
data = preview(ds)
data=8×1 table
ArrDelay
________
8
8
21
13
4
59
3
11
Read Subsets of Data
By default, read
reads from a TabularTextDatastore
20000 rows at a time. To read a different number of rows in each call to read
, modify the ReadSize
property of ds
.
ds.ReadSize = 15000;
Read subsets of the data from ds
using the read
function in a while
loop. The loop executes until hasdata(ds)
returns false
.
sums = []; counts = []; while hasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay); end
Compute the average arrival delay.
avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670
Reset the datastore to allow rereading of the data.
reset(ds)
Read One File at a Time
A datastore can contain multiple files, each with a different number of rows. You can read from the datastore one complete file at a time by setting the ReadSize
property to 'file'
.
ds.ReadSize = 'file';
When you change the value of ReadSize
from a number to 'file'
or vice versa, MATLAB® resets the datastore.
Read from ds
using the read
function in a while
loop, as before, and compute the average arrival delay.
sums = []; counts = []; while hasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay); end avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670
See Also
tabularTextDatastore
| tall
| mapreduce