How to read multiple huge text files, the fastest way?
    23 vues (au cours des 30 derniers jours)
  
       Afficher commentaires plus anciens
    
    Anand Uthaman
 le 18 Mar 2011
  
    
    
    
    
    Réponse apportée : bim
 le 25 Déc 2022
            Hi All,
I am quite new to Matlab and sorry for the naive question. Request your kind help on my problem as given below.
I have around 10,000 input text files to read and process in Matlab. The text file contains only numerical data but each file is around 12-15MB and hence the total size of the input is around 125~150GB.
First, I tried to use fgetl() to read each line from each file and iterate but it took very long. So I have modified the input text file format as a set of numbers separated by white spaces and used fscanf() to read to a matrix of size [1 inf]. Still it takes couple of hours to read all the 10,000 files.
I have tried to use parfor loop and ran the code in a matlabpool of cluster size 8 (the system is a linux server - 4 processors, each dual code). Even then, it takes more than 2 hours to read all the files.
Could anyone kindly let me know what is the fastest way to read this much huge data in Matlab? My requirement is to read this much data (125~150 GB) in a couple of minutes.
Note: I can change the format of the input text file to achieve the highest possible file read. But I would like to read the inputs as numbers only (not string) as during processing str2double() takes much time.
Thanks a million in advance. Expecting your expert advice.
Warm Regards
Anand Uthaman
7 commentaires
  Jason Ross
    
 le 18 Mar 2011
				The way to know if you are swapping is to watch something like "top". It seems that you might want to look somewhere less than 10,000 and more than 100 to see if you can do better :)
Réponse acceptée
  Jeremy Johnson
      
 le 18 Mar 2011
        If you have total control over the file format, storing the data in a binary file format would make reading the data out of the file much faster.
6 commentaires
Plus de réponses (2)
  Matt Tearle
    
 le 18 Mar 2011
        If you have to read it as ASCII, your best option is textscan, which will read directly into whatever numeric format you specify ( %f for double, %d for integer, etc).
  bim
 le 25 Déc 2022
        I have been using importdata for textfiles, but it is very slow for text unless you rename all the files to '.txt'.
The function below seems to do the job for structured text files.
The structure the function can handle is shown at the bottom the function: it only works for tables of float numbers
Let me know whether this works well.
%% READTEXTFILE reads text files without any checks
%  READTEXTFILE reads from file and immediately  filters out the selectColumns
%  READTEXTFILE can read any number of headerlines, but
%  the headerlines must contain both nRows='a number' and 'nColumns=a number' in separate lines
%  the lines containing nRows and nColumns must not contain any spaces
%
%  parameter filename = if the selected file is not a text file, the function will fail
%  the extension of the filename is ignored and does not need to be present
%  parameter selectedcolumns = header names of columns to be selected from the file
%  e.g., selectedcolumns = {'time', 'column_3'}
%
%  The read data is returned in a struct c
%  content.data = the actual data as a matrix
%  content.colheaders = the headers of the remaining columns
%  content.colheaders == selectedcolumns
%  content.textdata == content.colheaders
%  
function content=readtextfile(varargin) % filename,selectedcolumns
    tic
    selectedcolumns={};
    if nargin>2 || nargin ==0
        error('readtextfile: too many or too few arguments');
    elseif nargin ==2
        selectedcolumns=varargin{2};
    end
    filename=varargin{1};
    fid = fopen(filename,'rt');
    file.title = fgetl(fid);
    file.nrows=string([]);
    file.ncolumns=string([]);
    line = string(fgetl(fid));
    while line ~= "endheader"
        if length(file.nrows)==0
            file.nrows=regexp(line,'^nRows=(?<nrows>\d+)$','tokens','once');
        end
        if length(file.ncolumns)==0
            file.ncolumns=regexp(line,'^nColumns=(?<ncolumns>\d+)$','tokens','once');
        end
        line = string(fgetl(fid));
    end
    file.nrows=str2num(file.nrows);
    file.ncolumns=str2num(file.ncolumns);
    fsColHeaders = repmat([' %s'],1,file.ncolumns);
    colHeaders = textscan(fid,fsColHeaders,1,'EndOfLine','\r\n','MultipleDelimsAsOne',1); % 3rd param (N) == 1 --> read once
    fsData = repmat([' %f'],1,file.ncolumns);
    fileData = textscan(fid,fsData,'EndOfLine','\r\n','MultipleDelimsAsOne',1); % 3rd param (N) missing --> read until end of file
    colHeaders =cellfun(@char,colHeaders,'UniformOutput',false);
    [~,copiedColumns] = ismember(selectedcolumns,colHeaders);
    if length(copiedColumns)>0
        newMatrix= zeros(file.nrows,nnz(copiedColumns));
        iNewColumns=1;
        for iCopiedColumns = copiedColumns
            if iCopiedColumns>0
                newMatrix(:,iNewColumns) = fileData{iCopiedColumns};
                % newHeaders is not necessary, since it corresponds to selectedcolumns
                % but it is is helpful in checking proper operation of the function
                newHeaders(:,iNewColumns) = colHeaders(iCopiedColumns);
                iNewColumns=iNewColumns+1;
            end
        end
    else
         newMatrix = cell2mat(fileData);
         newHeaders = colHeaders;
    end
    fclose(fid);
    content.data = newMatrix;
    content.textdata = newHeaders;
    content.colheaders = newHeaders;
    toc
end
% example of possible headerlines
%{
the title
nRows=437
nColumns=17
any number of lines
endheader
time	column_1    column_2    column_3 ...
0.001   0.1234      0.3456      0.7891
0.002   0.2234      0.4456      0.8891
%}
0 commentaires
Voir également
Catégories
				En savoir plus sur MATLAB Parallel Server dans Help Center et File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!





