Importing large datasheet while ignoring randomly placed non-numeric data

Question

0 votes

image.png

I am trying to import a large set of data usually around 10000 rows by 200 columns. The majority of the entries are numeric data but there will be random non-numeric entries in the set. Is there a way to just exclude all non-numeric entries when importing data with textscan?

I've tried using the pattern matching exclude option for textscan but it gives me a "Undefined function or variable 'formatSpec'" error.

filename = data.us3
formatfpec = '%16f%16f%16f....%16f%f%[^\*]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', '', 'WhiteSpace', '', 'TextType', 'string',  'ReturnOnError', false);
fclose(fileID);
data = cell2mat(dataArray);

I'm trying to make this robust as well so non-numeric text will be removed regardless of its position in the datasheet

Any help is appreciated

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

dpb le 14 Sep 2018

Modifié(e) : dpb le 14 Sep 2018

Ouvrir dans MATLAB Online

0 votes

dataArray=cell2mat(textscan(fileID,'','TreatAsEmpty', repmat('*',1,15),'CollectOutput',1));

Make sure my old eyes counted the correct number of asterisks...

This will return a double array instead of cell array w/ NaN for the overflow fields.

While not the issue in reading the file, this is a terrible format; some data are not zero but have as few as one significant digit while large values are missing entirely owing to field overflow.

Do you have any control at all over the process that generates these files? Having them switch to a G16.9 format would eliminate all those problems including the overflow field to have to parse.

5 commentaires
Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens

dpb le 14 Sep 2018

Modifié(e) : dpb le 14 Sep 2018

I just typed "air code" in the edit window so it's quite possible did get a misplaced punctuation.(+)

A small section of the actual file would be the only way to really debug what might be going on...the above should(*) work just as desired for a "plain vanilla" file. Can't do anything useful from an image w/o having to create a test file and then don't know that there wasn't something unseen in the file that was important.

(*) Albeit, I guess I'm not sure I've tested the empty '' format option with 'TreatAsEmpty' to see if there's an interaction there, maybe. The magic incantation of an empty format string for an all-numeric file has the benefit of textscan automagically returning the data in the proper records-by-elementsperrecord array w/o having to count fields and writing a specific, counted format string that is a real pit(he proverbial)a(ppendage).

ADDENDUM (+) Oh, I see what I did...I came back and edited the original and when did added the 'CollectOutput' named pair at the end, thus putting the closing parenthesis in the wrong place. Not that it matters why, specifically, ... :)

Alex Szerszen le 14 Sep 2018

I'm not sure if there are any other text characters in the file. It is 10000X200 set of data and I only by chance found the overflow indicators.

I am however able to alter the format of the output file and increased the character limit to hopefully prevent the overflow indicators. A brief test has shown that this does get rid of the overflow indicators.

dpb le 14 Sep 2018

Ouvrir dans MATLAB Online

There must be something else in the file, then, I did a simple test case and it seems to work as expected--

>> type alex.dat
       0.949370243     0.647537226     0.038262460     0.123355203     0.293973863     0.680998914     0.644282796     0.187649735     0.416803602     0.782348041
       0.112291675     0.413806386     0.776160595            ****     0.388473403     0.049865826     0.088877801     0.087513561     0.175390865     0.615052207
       0.657602296     0.139910171     0.710691529     0.975162104     0.152392633     0.092873134     0.227578005     0.088578619     0.707417721     0.971135525
  >> fid=fopen('alex.dat','r');
  >> z=cell2mat(textscan(fid,'','TreatAsEmpty',{'****'},'CollectOutput',1,'ReturnOnError',0));
  >> whos z
    Name      Size            Bytes  Class     Attributes
    z         3x10              240  double              
>> z
z =
    0.9494    0.6475    0.0383    0.1234    0.2940    0.6810    0.6443    0.1876    0.4168    0.7823
    0.1123    0.4138    0.7762       NaN    0.3885    0.0499    0.0889    0.0875    0.1754    0.6151
    0.6576    0.1399    0.7107    0.9752    0.1524    0.0929    0.2276    0.0886    0.7074    0.9711
>> fid=fclose(fid);
>>

Connectez-vous pour commenter.

Importing large datasheet while ignoring randomly placed non-numeric data

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponses (1)

5 commentaires
Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens

Catégories

Tags

Community Treasure Hunt

Importing large datasheet while ignoring randomly placed non-numeric data

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponses (1)

5 commentaires Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens

Catégories

Tags

Voir également

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

5 commentaires
Afficher 3 commentaires plus anciens Masquer 3 commentaires plus anciens