Error reading data from combined datastore
6 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Dear all,
I'm working on a bioinformatics script and am trying to read in two large sequence (.fastq) text files and process them in parallel. Each entry in File 1 has a corresponding entry in File 2 and they need to be processed together. See attached for an example set of files (renamed .txt so they can be uploaded here).
To do this, I first created two separate datastores, one with each file, as follows:
ds_R1 = tabularTextDatastore(reads_folder_R1,"FileExtensions",[".fastq",".fastq.gz"]);
ds_R2 = tabularTextDatastore(reads_folder_R2,"FileExtensions",[".fastq",".fastq.gz"]);
I also defined ds.Readsize to pull 100,000 lines at a time from the datastores:
ds.ReadSize = 100000;
To ensure concurrent handling of each file pair, I combined the two datastores (each containing one file in the pair as above):
%Combine and partition datastores by file and return partition m
ds_R1_R2 = combine(ds_R1, ds_R2);
I created a while loop to pull data from the combined datastore ds_R1_R2 into a cell array 'reads', do operations on that cell array, and write the output to file.
while hasdata(ds_R1_R2)
[reads, info] = read(ds_R1_R2);
%Convert reads table to cell array
reads = table2cell(reads);
reads_R1 = reads(:,1);
reads_R2 = reads(:,2);
%do stuff to reads_R1 and reads_R2
end
I tested this code out and it works fine for a number of iterations of the while loop. However, it always fails with the following error message after the same number of iterations for a given pair of files (the exact iteration depends on which file pair it is processing).
Error using matlab.io.datastore.CombinedDatastore/read (line 144)
All tables being horizontally concatenated must have the same number of rows.
I've checked and confirmed that the number of lines in each file is exactly the same. The error is also thrown pretty early on and there is plenty of data remaining, so it's not because the end of the files is approached.
I'm quite puzzled and would greatly appreciate any input.
Thanks!
Kartik
2 commentaires
Walter Roberson
le 12 Sep 2022
if you dbstop if caught error and run until error and save() the values to a file, then dbquit and restore the file contents and try the [] operation manually... then does it succeed?
I suspect that there is a try/catch and I wonder if maybe it is a different error being caught but reported as-if it were a problem with different number of rows
Réponses (1)
Walter Roberson
le 12 Sep 2022
ReadSize defines a maximum number of rows to read at one time -- but it is permitted to read fewer rows. In particular it has some kind of internal buffer and avoids overfilling the buffer. If two different datastores have substantially different number of columns (or different widths for each column), then it would be possible for the buffer to get full with fewer rows for the datastore that has more (or wider) columns.
You could reduce the ReadSize to the point where each chunk of the wider datastore fits within the buffer.
The size of the buffer does not appear to be documented.
0 commentaires
Voir également
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!