Reading columns from a poorly-formatted text file

Question

dormant le 29 Août 2024

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/2148769-reading-columns-from-a-poorly-formatted-text-file

Commenté : Voss le 29 Août 2024

COUNT.txt

I need to read data from an old text file which has fixed columns for the data. I have tried using readtable, which was partly succesfull but seems to have been fooled by the data in places.

A shortened version of the file is attached. It has the following 'features' :

All columns are not always there
Sometimes there is no space between the columns
It uses -998 or -998.0 for NaN;
There are blank lines.

Is there an easy way to read this file, or do I need to be old-fashioned and read it line-by-line and then parse for the values?

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Voss le 29 Août 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/2148769-reading-columns-from-a-poorly-formatted-text-file#answer_1507324

Modifié(e) : Voss le 29 Août 2024

Ouvrir dans MATLAB Online

COUNT.txt

This file looks like a fixed-width file, but it's not quite, because lines 68 and 69 differ in format from the other lines.

dbtype COUNT.txt 65:70
  29 JUN 97 00:00   0   01109  40
  30 JUN 97 00:00   0   0  49  17
  04 SEP 97 00:00   4   1   0   8
  05 SEP 97 00:00   0   16  5  68 
  06 SEP 97 00:00   0   15  1  15
  07 SEP 97 00:00

For example, if you use fixed field sizes to parse line 65 like this

% 29 JUN 97 00:00 0 01109 40

% ^^^^^^^^^^^^^^^ 29 JUN 97 00:00

% ^^^^ 0

% ^^^^ 1109

% ^^^^ 40

Then the same parsing for line 68 gives this:

% 05 SEP 97 00:00 0 16 5 68

% ^^^^^^^^^^^^^^^ 05 SEP 97 00:00

% ^^^^ 0

% ^^^^ 1

% ^^^^ 6 5

% ^^^^ 68

Notice the 16 gets split apart: that field becomes 1 and the 6 is included in the next field, which becomes '6 5' (which will be interpreted as NaN).

On the off chance that lines 68 and 69 were inadverdently modified during the shortening process and the original file doesn't have this problem, you can use some code like this to read it:

opts = fixedWidthImportOptions( ...
    'VariableNamesLine',3, ...
    'DataLines',5, ...
    'NumVariables',7, ...
    'VariableWidths',[15 4 4 4 4 4 6], ...
    'VariableTypes',{'datetime','double','double','double','double','double','double'});
opts = setvaropts(opts,1,'InputFormat','dd MMM yy HH:mm','DatetimeFormat','preserveinput');
T = readtable('COUNT.txt',opts);
T = standardizeMissing(T,-998);
disp(T)
         Var1          VT     LP     HYB     RF     PLK    Var7
    _______________    ___    ___    ____    ___    ___    ____

Jan 96 00:00      4     12       0      0      0    NaN 
Jan 96 00:00      1      9       1      0      0    NaN 
Jan 96 00:00      1      6       0      0      0    NaN 
Jan 96 00:00      0      2       0      0      0    NaN 
Jan 96 00:00      1      0       0      0      0    NaN 
Jan 96 00:00      1      0       0      0      0    NaN 
Jan 96 00:00      0      2       0      0      0    NaN 
Jan 96 00:00      1      3       0      0      0    NaN 
Mar 96 00:00      2     37       1     10      0    NaN 
Mar 96 00:00      1     20       0     11      0    NaN 
Mar 96 00:00      2      5       3     17      0    NaN 
Mar 96 00:00      5      1       1     14      2    NaN 
Mar 96 00:00      0      5      10     20      0    NaN 
Mar 96 00:00      0      6      25     18      0    NaN 
Mar 96 00:00      2      0       3     10      0    NaN 
Mar 96 00:00      0      3       9     13      0    NaN 
Mar 96 00:00      0     15       1     28      0    NaN 
Mar 96 00:00      2     22       0     30      0    NaN 
Mar 96 00:00      0      7      27     32      0    NaN 
Mar 96 00:00      1     13     152     39      0    NaN 
Mar 96 00:00      3      1      40     33      0    NaN 
Mar 96 00:00      0      1      21     27      0    NaN 
Mar 96 00:00      1      2      23     24      0    NaN 
Mar 96 00:00      0      0      13     46      0    NaN 
Mar 96 00:00      2      3       0     45      0    NaN 
Apr 96 00:00      2      0       6     31      0    NaN 
Apr 96 00:00      3      5       8     38      0    NaN 
Apr 96 00:00    NaN    NaN     NaN    NaN      0    NaN 
Apr 96 00:00    NaN    NaN     NaN    NaN      0    NaN 
Apr 96 00:00      0      1     138     37      0      1 
Apr 96 00:00      0      4      21    110      0      1 
Apr 96 00:00      0      2      56     53      0      1 
Apr 96 00:00      4      4     132     58      0      2 
Apr 96 00:00      1      0      59     61      0      3 
Apr 96 00:00      0      0     720     25      0      1 
Apr 96 00:00      0      0     943      0      0      1 
Apr 96 00:00      0      0    1123      0      0      1 
Jul 96 00:00      5      3       9     47     48    NaN 
Jul 96 00:00     13      4       8     85     68    NaN 
Jul 96 00:00     23      3      34    115     19    NaN 
Jul 96 00:00     11      0     104     58     32    NaN 
Nov 96 00:00     28      2       0      0      0    NaN 
Nov 96 00:00    117      2       0      0      0    NaN 
Nov 96 00:00     64      1       0      3      0    NaN 
Nov 96 00:00      5      0       0      5      0    NaN 
Nov 96 00:00      0      0       0      5      0    NaN 
Nov 96 00:00     41      2       0      4      0    NaN 
Nov 96 00:00      1      4       0      2      0    NaN 
Nov 96 00:00      1      1       0      1    NaN    NaN 
Nov 96 00:00      1      1       0      1    NaN    NaN 
Mar 97 00:00     29      0      88     24    NaN    NaN 
Mar 97 00:00      0      1      12     15    NaN    NaN 
Mar 97 00:00      8      3      24     19    NaN    NaN 
Mar 97 00:00     37      4      50     22    NaN    NaN 
Jun 97 00:00      9      4    1373     22    NaN    NaN 
Jun 97 00:00      5      0     378     10    NaN    NaN 
Jun 97 00:00      4      0     102     17    NaN    NaN 
Jun 97 00:00      8      0      22     18    NaN    NaN 
Jun 97 00:00     17      0      70     23    NaN    NaN 
Jun 97 00:00     12      0     106     54    NaN    NaN 
Jun 97 00:00      0      0    1109     40    NaN    NaN 
Jun 97 00:00      0      0      49     17    NaN    NaN 
Sep 97 00:00      4      1       0      8    NaN    NaN 
Sep 97 00:00      0      1     NaN     68    NaN    NaN 
Sep 97 00:00      0      1     NaN     15    NaN    NaN 
Sep 97 00:00    NaN    NaN     NaN    NaN    NaN    NaN 
                NaT    NaN    NaN     NaN    NaN    NaN    NaN 

But again, for this file, the data from lines 68 and 69 (05 and 06 SEP 97) is incorrect because of the misalignment of the 16 and the 15 on those respective lines:

tail(T)
         Var1          VT     LP     HYB     RF     PLK    Var7
    _______________    ___    ___    ____    ___    ___    ____

    28 Jun 97 00:00     12      0     106     54    NaN    NaN 
    29 Jun 97 00:00      0      0    1109     40    NaN    NaN 
    30 Jun 97 00:00      0      0      49     17    NaN    NaN 
    04 Sep 97 00:00      4      1       0      8    NaN    NaN 
    05 Sep 97 00:00      0      1     NaN     68    NaN    NaN 
    06 Sep 97 00:00      0      1     NaN     15    NaN    NaN 
    07 Sep 97 00:00    NaN    NaN     NaN    NaN    NaN    NaN 
                NaT    NaN    NaN     NaN    NaN    NaN    NaN 

2 commentaires
Afficher AucuneMasquer Aucune

dormant le 29 Août 2024

Many thanks.

Voss le 29 Août 2024

You're welcome! Be sure to verify that this method works on the actual file.

Connectez-vous pour commenter.

Answer 2

Steven Lord le 29 Août 2024

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/2148769-reading-columns-from-a-poorly-formatted-text-file#answer_1507314

I would experiment with the settings controllable using the Import Tool to try to import the data. You could try running Import Tool with your smaller data set, import the data, and check that it imported as you expected. Then generate a script and run it on the larger file.

The conditions you identified that could make it difficult to read are:

All columns are not always there
Sometimes there is no space between the columns
It uses -998 or -998.0 for NaN;
There are blank lines.

If you specify that the file is fixed-width I think the second condition is okay. Unimportable cell handling may take care of the first condition. I think the blank lines will be imported as a row of unimportable cells, which you could repair after the fact with the missing-handling functions like rmmissing. Once you've imported the data, you could use standardizeMissing to replace -998 with NaN.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

dormant le 29 Août 2024

Thanks very much for the suggestion to use Import Tool.

Connectez-vous pour commenter.

Reading columns from a poorly-formatted text file

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponse acceptée

2 commentaires
Afficher AucuneMasquer Aucune

Plus de réponses (1)

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

Reading columns from a poorly-formatted text file

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponse acceptée

2 commentaires Afficher AucuneMasquer Aucune

Plus de réponses (1)

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

2 commentaires
Afficher AucuneMasquer Aucune

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens