Importing large datasheet while ignoring randomly placed non-numeric data

I am trying to import a large set of data usually around 10000 rows by 200 columns. The majority of the entries are numeric data but there will be random non-numeric entries in the set. Is there a way to just exclude all non-numeric entries when importing data with textscan?
I've tried using the pattern matching exclude option for textscan but it gives me a "Undefined function or variable 'formatSpec'" error.
filename = data.us3
formatfpec = '%16f%16f%16f....%16f%f%[^\*]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', '', 'WhiteSpace', '', 'TextType', 'string', 'ReturnOnError', false);
fclose(fileID);
data = cell2mat(dataArray);
I'm trying to make this robust as well so non-numeric text will be removed regardless of its position in the datasheet
Any help is appreciated

Réponses (1)

dpb
dpb le 14 Sep 2018
Modifié(e) : dpb le 14 Sep 2018
dataArray=cell2mat(textscan(fileID,'','TreatAsEmpty', repmat('*',1,15),'CollectOutput',1));
Make sure my old eyes counted the correct number of asterisks...
This will return a double array instead of cell array w/ NaN for the overflow fields.
While not the issue in reading the file, this is a terrible format; some data are not zero but have as few as one significant digit while large values are missing entirely owing to field overflow.
Do you have any control at all over the process that generates these files? Having them switch to a G16.9 format would eliminate all those problems including the overflow field to have to parse.

5 commentaires

Alex Szerszen
Alex Szerszen le 14 Sep 2018
Modifié(e) : Alex Szerszen le 14 Sep 2018
Using your code as is gave me a "too many input arguments" for cell2mat. I think you misplaced a ")" after the repmat command. Moving that to after the 'collectOutput',1) does run but it truncates the data and only gives me a single column of data.
dpb
dpb le 14 Sep 2018
Modifié(e) : dpb le 14 Sep 2018
I just typed "air code" in the edit window so it's quite possible did get a misplaced punctuation.(+)
A small section of the actual file would be the only way to really debug what might be going on...the above should(*) work just as desired for a "plain vanilla" file. Can't do anything useful from an image w/o having to create a test file and then don't know that there wasn't something unseen in the file that was important.
(*) Albeit, I guess I'm not sure I've tested the empty '' format option with 'TreatAsEmpty' to see if there's an interaction there, maybe. The magic incantation of an empty format string for an all-numeric file has the benefit of textscan automagically returning the data in the proper records-by-elementsperrecord array w/o having to count fields and writing a specific, counted format string that is a real pit(he proverbial)a(ppendage).
ADDENDUM (+) Oh, I see what I did...I came back and edited the original and when did added the 'CollectOutput' named pair at the end, thus putting the closing parenthesis in the wrong place. Not that it matters why, specifically, ... :)
Oh...is the repmat count correct? That couldn't possibly be a cause of the failure maybe?
Is there any other text in the file besides the overflow indicator shown in the sample image? That would definitely cause a problem.
I'm not sure if there are any other text characters in the file. It is 10000X200 set of data and I only by chance found the overflow indicators.
I am however able to alter the format of the output file and increased the character limit to hopefully prevent the overflow indicators. A brief test has shown that this does get rid of the overflow indicators.
There must be something else in the file, then, I did a simple test case and it seems to work as expected--
>> type alex.dat
0.949370243 0.647537226 0.038262460 0.123355203 0.293973863 0.680998914 0.644282796 0.187649735 0.416803602 0.782348041
0.112291675 0.413806386 0.776160595 **** 0.388473403 0.049865826 0.088877801 0.087513561 0.175390865 0.615052207
0.657602296 0.139910171 0.710691529 0.975162104 0.152392633 0.092873134 0.227578005 0.088578619 0.707417721 0.971135525
>> fid=fopen('alex.dat','r');
>> z=cell2mat(textscan(fid,'','TreatAsEmpty',{'****'},'CollectOutput',1,'ReturnOnError',0));
>> whos z
Name Size Bytes Class Attributes
z 3x10 240 double
>> z
z =
0.9494 0.6475 0.0383 0.1234 0.2940 0.6810 0.6443 0.1876 0.4168 0.7823
0.1123 0.4138 0.7762 NaN 0.3885 0.0499 0.0889 0.0875 0.1754 0.6151
0.6576 0.1399 0.7107 0.9752 0.1524 0.0929 0.2276 0.0886 0.7074 0.9711
>> fid=fclose(fid);
>>

Connectez-vous pour commenter.

Catégories

En savoir plus sur Large Files and Big Data dans Centre d'aide et File Exchange

Commenté :

dpb
le 14 Sep 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by