How to save a tall array / table to a text or csv file?

37 vues (au cours des 30 derniers jours)
Benjamin Imbach
Benjamin Imbach le 5 Sep 2017
Modifié(e) : Adam Filion le 1 Oct 2018
I'm trying to save the contents of a tall table that doesn't fit into memory to a .txt file.
MATLAB provides the function write. However, it can only write the table contents to .mat files. So far I haven't found an option or another function that could write the data to a text file.
A workaround I'm trying to do is to continously gather a part of the tall table, save it to a text file, gather the next part etc. until the table end is reached. For that to work in tall syntax I suppose I need a vector containing the row numbers of the tall table. That way I could find the index of the rows I want to gather and write with:
idx = (RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit);
With the index vector it is then possible to gather the rows I want to save to the text file:
TableToSave = gather(Table(idx,:));
once the data is gathered the table could be saved with the writetable function. After that, the lowerLimit and upperLimit could be adjusted and a new chunk of the table could be saved.
The point where I'm failing is the construction of this vector containing the row numbers. In theory it's simply
RowNumbers = 1:1:size(Table,1); // Or: RowNumbers = 1:1:gather(size(Table,1));
The first one doesn't work because the 1:X syntax doesn't support 'tall doubles' and the second approach doesn't work I suppose because the resulting RowNumbers and index vector are completely in-memory while the table is not.
So if I try to
idx = tall((RowNumbers >= lowerLimit) & (RowNumbers <= upperLimit));
and use
gather(head(DB(idx,:)));
The following error appears:
Incompatible tall array arguments. The first dimension in each tall array must have the same size, and each tall array must be based on the same datastore.
To sum up:
1. Is there another way to save tall arrays / tables to text files?
2. How to create an "unevaluated" row number array that then could be used for the described workaround?
Thanks a lot!

Réponse acceptée

Edric Ellis
Edric Ellis le 7 Sep 2017
I think the least inefficient method is probably to combine use of tall/write with writetable. Calling gather repeatedly is going to be inefficient - the approach below takes 2 passes over the data (one of which is in the optimised .mat form, so should be quicker). Here's the sort of thing I mean.
%%Step 1 - create a tall table
varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'};
ds1 = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ...
'SelectedVariableNames', varnames);
tt = tall(ds1);
%%Step 2 - operate on tall table
tt.TotalDelay = tt.ArrDelay + tt.DepDelay;
%%Step 3 - use tall/write to emit .mat files
writeDir = tempname
mkdir(writeDir);
write(writeDir, tt);
%%Step 4 - iteratively convert the tall/write output to CSV
ds = datastore(writeDir);
csvDir = tempname
mkdir(csvDir);
idx = 0;
while hasdata(ds)
idx = 1 + idx;
fname = fullfile(csvDir, sprintf('out_%06d.csv', idx));
writetable(read(ds), fname);
end
A refinement of this would be to partition the datastore and operate on it in parallel, using the techniques described here in the documentation.
  2 commentaires
Edric Ellis
Edric Ellis le 7 Sep 2017
Here's the parfor version of Step 4.
%%Step 5 - use parfor to parallelise the writetable loop
ds = datastore(writeDir);
N = numpartitions(ds, gcp);
csvDir2 = tempname
mkdir(csvDir2);
parfor idx1 = 1 : N
idx2 = 0;
subds = partition(ds, idx1, N);
while hasdata(subds)
idx2 = 1 + idx2;
fname = fullfile(csvDir2, sprintf('out_%06d_%06d.csv', idx1, idx2));
writetable(read(subds), fname);
end
end
Benjamin Imbach
Benjamin Imbach le 8 Sep 2017
Modifié(e) : Benjamin Imbach le 8 Sep 2017
Thanks! This is what I was looking for. I still wish there was a direct way to write the tall table to a csv but until then this will do!
A small correction:
subds = partition(ds, idx1, N);
should be
subds = partition(ds, N, idx1);
In your case it worked since the datastore is probably so small that N = 1.
Thanks again!

Connectez-vous pour commenter.

Plus de réponses (1)

Adam Filion
Adam Filion le 1 Oct 2018
Modifié(e) : Adam Filion le 1 Oct 2018
As of R2018b the tall write command now directly supports writing tall arrays to .txt files (also .csv, .xls* and custom formats) in addition to .mat:

Catégories

En savoir plus sur Live Scripts and Functions dans Help Center et File Exchange

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by