How can I estimate the time required by textscan and the size of the output?

Question

Alexandru le 30 Juil 2014

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/144125-how-can-i-estimate-the-time-required-by-textscan-and-the-size-of-the-output

Commenté : per isakson le 31 Juil 2014

Réponse acceptée : per isakson

Ouvrir dans MATLAB Online

Hello,

I am running Matlab 2013b on Windows 7. I have 8 GB RAM memory and a I set the swap file to 20 GB.

I am trying to read a relatively large txt file that is tab separated. The size of the file on the hard disk is a little over 2 GB. There are 6 columns and approx. 64 million rows in the file. The entries are mixed (strings and numbers with some missing values).

At this point I am using:

textscan(fid,repmat('%s',1,6),'delimiter','\t');

It is running for about 4 hours now using about 6.5 GB RAM.

1. I would like to know how can I estimate the time it takes to read the file and the size of the output.

2. After it is done I would like to extract the numerical values from the resulting cell matrix and save that to a .mat file. Any idea how long that would take?

3. Is there any better way of doing this? If I could extract from the file a matrix with the numerical values only (setting everything else to NaN) it would be great.

Thanks!

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Alexandru le 31 Juil 2014

Just from task manager in Windows...

per isakson le 31 Juil 2014

Modifié(e) : per isakson le 31 Juil 2014

Strange! Could there be something using memory that the task manager doesn't report on? Is your system 64bit?

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

per isakson le 31 Juil 2014

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/144125-how-can-i-estimate-the-time-required-by-textscan-and-the-size-of-the-output#answer_146956

Modifié(e) : per isakson le 31 Juil 2014

Ouvrir dans MATLAB Online

I made an experiment with R2013a, 64bit, Win7, 8GB RAM, 9GB page file size and a mechanical HD

created a file with "6 columns and approx. 64 million rows"
read a piece of the file with textscan after restart of Matlab
Monitored the memory usage with the Windows Task Manager

    Elapsed time is 192.597879 seconds.
    >> cac{1}(1:3)
    ans = 
        'Col1'
        'Col1'
        'Col1'
    >> cac{2}(1:3)
    ans =
         1
         1
         1
    >> cac{6}(1:3)
    ans =
         3
         3
         3
    >> whos cac
      Name      Size                 Bytes  Class    Attributes
      cac       1x6             1920000672  cell

where code is

    fid = fopen('c:\tmp\test.txt');
    M   = cumsum(ones( 3, 64e6 ), 1 );
    fprintf( 'Col1\t%4.1f\tCol2\t%4.1f\tCol3\t%4.1f\n', M )
    fclose( fid );
    tic
    fid = fopen('c:\tmp\test.txt');
    cac = textscan( fid, '%s%f%s%f%s%f', 5e6, 'Delimiter', '\t' );
    fclose( fid );
    toc

start of Matlab
running of experiment

Results

Reading and parsing 5 million rows took three minutes and peaked at 4.8GB RAM usage
5 million rows produced a 2GB variable, cac, in Matlab.
An experiment to read the entire file showed that speed decreased drastically when there was no more free physical RAM. (I killed the process.) 8GB RAM would allow effective reading of nearly ten million rows.

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Alexandru le 31 Juil 2014

Modifié(e) : Alexandru le 31 Juil 2014

Thanks!

Since I don't know what characters show up in the text I cannot skip them and I cannot collect the output. However breaking into blocks and reading as strings works great.

I never before realized that the time to read the file increases exponentially. I tested my file and the results were:

1e3 rows - 0.035627 seconds.

1e4 rows - 0.097302 seconds.

1e5 rows - 1.159013 seconds.

1e6 rows - 30.041312 seconds.

per isakson le 31 Juil 2014

Ouvrir dans MATLAB Online

That's surprising results! It doesn't make sense to me. Here is another one that doesn't make sense.

    %s%f%s%f%s%f
    Elapsed time is 1.670506 seconds.
    Elapsed time is 0.001105 seconds.
    Elapsed time is 0.007156 seconds.
    Elapsed time is 0.066814 seconds.
    Elapsed time is 0.747540 seconds.
    Elapsed time is 13.118503 seconds.
    %s%s%s%s%s%s
    Elapsed time is 0.754963 seconds.
    Elapsed time is 0.001459 seconds.
    Elapsed time is 0.009222 seconds.
    Elapsed time is 0.090024 seconds.
    Elapsed time is 1.193596 seconds.
    Elapsed time is 36.568043 seconds.
    >>

where

    fid = fopen('c:\tmp\test.txt');
    M   = cumsum(ones( 3, 64e6 ), 1 );
    fprintf( 'Col1\t%4.1f\tCol2\t%4.1f\tCol3\t%4.1f\n', M )
    fclose( fid );
    disp( '%s%f%s%f%s%f' )
    for jj = 1 : 6
        tic
        fid = fopen('c:\tmp\test.txt');
        cac = textscan( fid, '%s%f%s%f%s%f', 10^jj, 'Delimiter', '\t' );
        fclose( fid );
        toc
    end
    disp( '%s%s%s%s%s%s' )
    for jj = 1 : 6
        tic
        fid = fopen('c:\tmp\test.txt');
        cac = textscan( fid, '%s%s%s%s%s%s', 10^jj, 'Delimiter', '\t' );
        fclose( fid );
        toc
    end

Connectez-vous pour commenter.

Answer 2

dpb le 30 Juil 2014

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/144125-how-can-i-estimate-the-time-required-by-textscan-and-the-size-of-the-output#answer_146952

Don't know that there is any metric to predict run time other than testing as it's so dependent upon the machine characteristics, not just size.

Two things I can think of to try --

a) Use the specific format for the data file -- strings for string, numeric for numbers. Skip ('%*s' for example to skip a string field) any fields that aren't mandatory. Use "'collectoutput',true" to gather the various types together. This will bypass a subsequent conversion step.

b) Use the feature of textscan to process the file in pieces -- say 1 to a few MB roughly per pass.

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Alexandru le 31 Juil 2014

Yes, but what I am saying is that I want the numeric part of each column.

If a column is ['char'; 23.1; 16; ] I need to extract [NaN; 23.1; 16; NaN]. Hope it's clear now...

per isakson le 31 Juil 2014

Modifié(e) : per isakson le 31 Juil 2014

Ouvrir dans MATLAB Online

Yes, if it is a reasonable number of different string constants and you know them beforehand.

    >> cac = textscan( 'char; 23.1; 16', '%f', 'Delimiter', ';' ...
                    ,  'treatAsEmpty', {'char'} );
    >> cac{:}
    ans =
           NaN
       23.1000
       16.0000

nan regardless of case is converted to "NaN"

Connectez-vous pour commenter.

How can I estimate the time required by textscan and the size of the output?

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Réponse acceptée

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Plus de réponses (1)

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

How can I estimate the time required by textscan and the size of the output?

3 commentaires Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

Réponse acceptée

4 commentaires Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Plus de réponses (1)

4 commentaires Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Community Treasure Hunt

3 commentaires
Afficher 1 commentaire plus ancienMasquer 1 commentaire plus ancien

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens