Read and Analyze Hadoop Sequence File
This example shows how to create a datastore for a Sequence file containing
key-value data. Then, you can read and process the data one block at a time. Sequence files
are outputs of mapreduce operations that use Hadoop®.
Set the appropriate environment variable to the location
where Hadoop is installed. In this case, set the MATLAB_HADOOP_INSTALL environment
variable.
setenv('MATLAB_HADOOP_INSTALL','/mypath/hadoop-folder')hadoop-folder is the folder where Hadoop is
installed and mypath is the path to that
folder.
Create a datastore from the sample file, mapredout.seq,
using the datastore function. The sample file
contains unique keys representing airline carrier codes and corresponding
values that represent the number of flights operated by that carrier.
ds = datastore('mapredout.seq')
ds =
KeyValueDatastore with properties:
Files: {
' ...\matlab\toolbox\matlab\demos\mapredout.seq'
}
ReadSize: 1 key-value pairs
FileType: 'seq'datastore returns a KeyValueDatastore.
The datastore function automatically determines
the appropriate type of datastore to create.
Set the ReadSize property to six so
that each call to read reads at most six key-value
pairs.
ds.ReadSize = 6;
Read subsets of the data from ds using
the read function in a while loop.
For each subset of data, compute the sum of the values. Store the
sum for each subset in an array named sums. The while loop
executes until hasdata(ds) returns false.
sums = []; while hasdata(ds) T = read(ds); T.Value = cell2mat(T.Value); sums(end+1) = sum(T.Value); end
View the last subset of key-value pairs read.
T
T =
Key Value
________ _____
'WN' 15931
'XE' 2357
'YV' 849
'ML (1)' 69
'PA (1)' 318Compute the total number of flights operated by all carriers.
numflights = sum(sums)
numflights =
123523
See Also
datastore | KeyValueDatastore | mapreduce | tall