How to read text files from each subfolder

Question

0 votes

Hi,

I have a main folder which contains several sub folders, now I want to read text files from each subfolder, save the data into ".xlsx" of each subfolders data by its subfolder name. For example read data from subfolder1 and save the data as "subfolder1.xlsx", and subfolder2 data as "subfolder2.xlsx".

Read text files, and extract the data as mentioned below:

1. Above the first dotted (-----------) line, extract the information: RainFallID, IINT, Rain Result, Start Time

2. Between two dotted lines(-------), the first column and 3rd column, in 3rd column if the data is mixed only keep the first part (for example 0.67 mm--> 0.67, 60.67+34e %-->60.67+34e, and if it is text like "False End"-->False End).

Please help some one kindly,

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

Cedric le 9 Sep 2017

Modifié(e) : Cedric le 9 Sep 2017

Ouvrir dans MATLAB Online

3 votes

Try something along this line:

 % - Define output header.
 header = {'RainFallID', 'IINT', 'Rain Result', 'Start Time', 'Param1.pipe', ...
    '10 Un Para2.pipe', 'Verti 2 mixing.dis', 'Rate.alarm times'} ;
 nHeaderCols = numel( header ) ;
 % - Build listing sub-folders of main folder.
 D_main = dir( 'Mainfolder' ) ;
 D_main = D_main(3:end) ;             % Eliminate "." and ".."
 % - Iterate through sub-folders and process.
 for dId = 1 : numel( D_main )
    % - Build listing files of sub-folder.
    D_sub = dir( fullfile( 'Mainfolder', D_main(dId).name, '*.txt' )) ;
    nFiles = numel( D_sub ) ;
    % - Prealloc output cell array.
    data = cell( nFiles, nHeaderCols ) ;
    % - Iterate through files and process.
    for fId = 1 : nFiles
        % - Read input text file.
        inLocator = fullfile( 'Mainfolder', D_main(dId).name, D_sub(fId).name ) ;
        content = fileread( inLocator ) ;
        % - Extract relevant data.
        rainfallId = str2double( regexp( content, '(?<=RainFallID\s+:\s*)\d+', 'match', 'once' )) ;
        iint       = regexp( content, '(?<=IINT\s+:\s*)\S+', 'match', 'once' ) ;
        rainResult = regexp( content, '(?<=Rain Result\s+:\s*)\S+', 'match', 'once' ) ;
        startTime  = strtrim( regexp( content, '(?<=Start Time\s+:\s*).*?(?= -)', 'match', 'once' )) ;
        param1Pipe = str2double( regexp( content, '(?<=Param1.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
        tenUn      = str2double( regexp( content, '(?<=10 Un Para2.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
        verti2     = regexp( content, '(?<=Verti 2 mixing.dis\s+\S+\s%\s+)\S+', 'match', 'once' ) ;
        rateAlarm  = strtrim( regexp( content, '(?<=Rate.alarm times\s+\S+\s+)[^\r\n]+', 'match', 'once' )) ;
        % - Populate data cell array.
        data(fId,:) = {rainfallId, iint, rainResult, startTime, ...
            param1Pipe, tenUn, verti2, rateAlarm} ;
    end
    % - Output to XLSX.
    outLocator = fullfile( 'OutputFolder', sprintf( '%s.xlsx', D_main(dId).name )) ;
    fprintf( 'Output XLSX: %s ..\n', outLocator ) ;
    xlswrite( outLocator, [header; data] ) ;
 end

Note that if you have a recent version of MATLAB, you can use the `folder` field of the struct outputed by DIR, and simplify most FULLFILE calls.

EDIT 4:09pm

Just a few extra comments. While it may look complicated, you should be fine with most of the code here. The general approach is

 Iterate through sub folders of 'Mainfolder'
      Iterate through files of sub folder
           Extract data from file and store in data array
      Export data array to relevant Excel file

The part that will likely be the most complex for you is the data extraction. One quick option for this is pattern matching using regular expressions. You can see a series of calls to REGEXP:

.. = regexp( content, pattern, option1, option2, .. )

This extract from content a string that matches the pattern. When you need to export a number we convert it to double using STR2DOUBLE. When it may capture extra white spaces we trim it using STRTRIM.

Regular expressions are a big topic, so it is normal if you don't really understand the patterns. In short,

 aAb,123 etc : are literals; they are simply matched and they don't have
               any special meaning 
 \s, \S, \d  : match a single white-space, non white-space, numeric digit respectively
 *, +        : mean zeros or more, and one or more respectively times the pattern that precedes
               \d+ hence means one or more numeric digit
 [..], [^..] : defines a set of characters (-sets) to match or not to match respectively
               [\d\s]+ hence means one or more element of either \d or \s
 (?<=..)     : defines a look behind
               (?<=hello )world matches 'world' when it is preceded by 'hello '
 (?=..)      : defines a look forward
               hello(?= world) matches 'hello' when it is followed by ' world'
 .           : matches any character. To match a the character '.', it has to be escaped with \
               .[\d\.]+ matches any character followed by one or more characters that
               are either a numeric digit or a '.'

Given this information, you can understand the pattern for extracting e.g. the value of IINT:

'(?<=IINT\s+:\s*)\S+'

which is, match

(?<=..)\S+ : one or more non white-space preceded by something

and the something is

 IINT\s+:\s*   : the literal 'IINT' followed by one or more white-spaces,
                 followed by the literal ':', followed by zero or more white-spaces

Cheers,

Cedric

22 commentaires
Afficher 20 commentaires plus anciens Masquer 20 commentaires plus anciens

Mekala balaji le 22 Sep 2017

Modifié(e) : Cedric le 22 Sep 2017

Ouvrir dans MATLAB Online

Sir,

Kindly forgive me, it really nice code one I have seen so far. One kind clarification:

1. Can I make it generalized form: like extract all parameter data in colum1 and colum3

to generalize: in each iteration, and each text file, just extract all parameter data:

Generalize the following lines:

          rainfallId = str2double( regexp( content, '(?<=RainFallID\s+:\s*)\d+', 'match', 'once' )) ;
          iint       = regexp( content, '(?<=IINT\s+:\s*)\S+', 'match', 'once' ) ;
          rainResult = regexp( content, '(?<=Rain Result\s+:\s*)\S+', 'match', 'once' ) ;
          startTime  = strtrim( regexp( content, '(?<=Start Time\s+:\s*).*?(?= -)', 'match', 'once' )) ;
          param1Pipe = str2double( regexp( content, '(?<=Param1.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
          tenUn      = str2double( regexp( content, '(?<=10 Un Para2.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
          verti2     = regexp( content, '(?<=Verti 2 mixing.dis\s+\S+\s%\s+)\S+', 'match', 'once' ) ;
          rateAlarm  = strtrim( regexp( content, '(?<=Rate.alarm times\s+\S+\s+)[^\r\n]+', 'match', 'once' )) ;

next step: filter the data according to the list:

The list:
RainFallID
IINT
Rain Result
Start Time
Param1.pipe
10 Un Para2.pipe
Verti 2 mixing.dis
Rate.alarm times

Mekala balaji le 23 Sep 2017

Modifié(e) : Cedric le 23 Sep 2017

Ouvrir dans MATLAB Online

Sir,

My main purpose is my text file size is even bigger and I have more than 100 parameters, I just want to know rather to write 100 line to extract each parameter, will there be any general code(in just a few lines) which parse all parameters, and then select the required one.

for example: Between two dotted lines(-------) Extracts the first columns and 3rd column of all existing paramters as below:

Original:
Param1.pipe                    0.56 cm       0.67 mm
10 Un Para2.pipe               15.4 mm       1.4 mm
Verti 2 mixing.dis             0.0e+1.9 %    60.67+34e %   
Rate.alarm times               yes           False End

Step1: extracts all parameters:

data_temp:
Param1.pipe            0.67
10 Un Para2.pipe       1.4
Verti 2 mixing.dis     60.67+34e
Rate.alarm times       False End
 Step2:Just extracts the required data (matching with list) the list is:
Param1.pipe
Rate.alarm times

So, final data:(may be using ismember function)

Param1.pipe            0.67
Rate.alarm times       False End

Many thanks Sir,

Cedric le 25 Sep 2017

Modifié(e) : Cedric le 25 Sep 2017

Ouvrir dans MATLAB Online

The problem is that we could match one number per row in the first version of your question, but now you want to extract multiple.

This is more complex and I cannot do the whole thing for you. I can give you hints but you will have to try programming that by yourself.

If I had to extract all parameter names and all values, plus the header, I would build some flexible code that is able to extract an Item array whatever its number of rows and columns. They are not really well formatted but I think that we can safely assume that we can use "more than one white space" as a column separator.

I can show you how to start but then you have to integrate the approach into your existing code.

Let's train with the content of one specific file:

>> content = fileread( 'RainFallReport3.txt' ) ;

we can proceed with a nest approach. First we try to target the item blocks:

>> blocks = regexp( content, 'Item.*?---.*?\s---', 'match' )

which outputs a cell array with two cells. If we look at the first we recognize the first Items block:

 >> blocks{1}
 ans =
 Item                           Open           Close       
 ----------------------------------------------------------
 Param1.pipe                    0.56 cm       0.67 mm   
 10 Un Para2.pipe               15.4 mm       1.4 mm      
 Verti 2 mixing.dis             0.0e+1.9 %    60.67+34e % 
 Rate.alarm times               yes           False End   
 ---

and if we look at blocks{2} we recognize the second Items block. Now we can develop code for extracting blocks content. First we try to split the block into a header and a content. We train with block 1:

>> parts = regexp( blocks{1}, '--+', 'split ') ;

if you look at parts{1} you will recognize the header, parts{2} the content, and parts{3} can be discarded. Now both the header string and the content string can be split using "more than one white space" as a separator. We trim them for extra white spaces first to avoid having extra empty cells:

 >> blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' )
 blockHeader =
  1×4 cell array
    'Item'    'Open'    'Close'

and same for the content:

 >> blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' )
 blockContent =
  1×12 cell array
  Columns 1 through 6
    'Param1.pipe'    '0.56 cm'    '0.67 mm'    '10 Un Para2.pipe'    '15.4 mm'    '1.4 mm'
  Columns 7 through 12
    'Verti 2 mixing.dis'    '0.0e+1.9 %'    '60.67+34e %'    'Rate.alarm times'    'yes'    'False End'

which is a 1x12 cell array with all the entries. We can reshape it based on the relevant number of columns that we know from the header:

 >> nCols = numel( blockHeader ) ;
 >> blockContent = reshape( blockContent, nCols, [] ).'
 blockContent =
  4×3 cell array
    'Param1.pipe'           '0.56 cm'       '0.67 mm'    
    '10 Un Para2.pipe'      '15.4 mm'       '1.4 mm'     
    'Verti 2 mixing.dis'    '0.0e+1.9 %'    '60.67+34e %'
    'Rate.alarm times'      'yes'           'False End'

Then we can process the data. It seems that if it can be split into a number and a unit, you want to save the number, and otherwise you want to save whole string:

 nRows = size( blockContent, 1 ) ;
 for rowId = 1 : nRows
    for colId = 2 : nCols   % Starting at 2, first is item name.
         num = str2double( strsplit( blockContent{rowId,colId}, ' ' )) ;
         if ~isnan( num(1) )
            blockContent{rowId,colId} = num(1) ;
         end
    end
 end

After this, you can see the update of the content:

 >> blockContent
 blockContent =
  4×3 cell array
    'Param1.pipe'           [    0.5600]    [     0.6700]
    '10 Un Para2.pipe'      [   15.4000]    [     1.4000]
    'Verti 2 mixing.dis'    '0.0e+1.9 %'    '60.67+34e %'
    'Rate.alarm times'      'yes'           'False End'

and then you have to code the rest, which is more standard MATLAB, to create the row of data based on all information from all blocks store it in a cell array that you will output to CSV or XLSX ultimately.

Note that this approach answers one of your other question: how to process these Item tables without hard-coding patterns for every item.

So all in all, you already have a loop over folders and files, which reads a file into a char variable content. What you have to do is to integrate what I just explained in this loop, and implement what is missing:

 for folderId = ...
    ...
    for fileId = ....
       content = fileread( ... ) ;
       ...
       % - Code for processing Item blocks.
       blocks = regexp( content, 'Item.*?---.*?\s---', 'match' ) ;
       for blockId = 1 : numel( blocks )     
          parts = regexp( blocks{blockId}, '--+', 'split ') ;
          blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' ) ;
          blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' ) ;
          nCols = numel( blockHeader ) ;
          blockContent = reshape( blockContent, nCols, [] ).' ;
          nRows = size( blockContent, 1 ) ;
          for rowId = 1 : nRows
             for colId = 2 : nCols   % Starting at 2, first is item name.
                num = str2double( strsplit( blockContent{rowId,colId}, ' ' )) ;
                if ~isnan( num(1) )
                   blockContent{rowId,colId} = num(1) ;
                end
             end
          end
          % * your job *
       end
    end
    ...
 end

Mekala balaji le 28 Sep 2017

Ouvrir dans MATLAB Online

I use the following code:
close all
clear all
clc
Maifolder='D:\Mekala_Backupdata\Matlab2010\Mainfolder4';
 D_main = dir('Mainfolder') ;
 D_main = D_main(3:end) ;             % Eliminate "." and ".."
% keyboard
% - Iterate through sub-folders and process.
 for dId = 1 : numel( D_main )
     % - Build listing files of sub-folder.
    D_sub = dir( fullfile('D:\Mekala_Backupdata\Matlab2010\Mainfolder4', D_main(dId).name, '*.txt' )) ;
    nFiles = numel( D_sub ) ;
%     keyboard
    % - Prealloc output cell array.
%     data = cell( nFiles, nHeaderCols ) ;
    % - Iterate through files and process.
    for fId = 1 : nFiles
        % - Read input text file.
        inLocator = fullfile( 'D:\Mekala_Backupdata\Matlab2010\Mainfolder4', D_main(dId).name, D_sub(fId).name ) ;
%         keyboard
        content = fileread( inLocator ) ;
        blocks = regexp( content, 'Item.*?---.*?\s---', 'match' );
%         blocks = regexp( content, 'Item.*?---.*?\s---', 'match' );
%         keyboard
        parts = regexp(blocks{1}, '--+', 'split') ;
        blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' );
        keyboard
        blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' );
        nCols = numel( blockHeader ) ;
%         blockContent = reshape( blockContent, nCols, [] ).';
        blockContent = reshape( blockContent, 4, [] ).';
    end
 end

Cedric le 28 Sep 2017

Modifié(e) : Cedric le 28 Sep 2017

Ouvrir dans MATLAB Online

The problem is that your input files have a lot of discrepancies and differences. The last version had

 Item                           Min           Max            Std    Avg 
 --------------------------------------------------------------------------
 Param1.pipe                    0.56 cm       0.67 mm        52     102
 10 Un Para2.pipe               15.4 mm       1.4 mm         69     21
 Verti 2 mixing.dis             0.0e+1.9 %    60.67+34e %    63     12
 Rate.alarm times               yes           False End     102     201
 ------------------------------------------------------------------------------

where you can observe that we cannot assume that columns content is well aligned (column Std is right justified and column Avg is left justified apparently), so we cannot use positions as columns delimiters.

Also, columns have content that contain multiple parts separated by a white space. One a the few options that we had for separating columns was to count on the fact that more than one white space seemed to be separating columns. But in your new version we have:

 Item                           Open           Close end     
 --------------------------------------------------------------
 Param1.pipe                    0.56 cm       0.67 mm     3.3
 10 Un Para2.pipe               15.4 mm       1.4 mm      NA
 Verti 2 mixing.dis             0.0e+1.9 %    60.67+34e % 5.6 
 Rate.alarm times               yes           False End   8.9

where you apparently have four columns but sometimes a single space as a separator. Based on this, we could have defined that any white space is a separator in the header, but that can not work because sometimes you seems to have Std in the header and other times Std dev :

 Item                           Min           Program   Max         Std Dev    Avg 
 ----------------------------------------------------------------------------------
 Param1.pipe                    0.56 cm    60 kk     0.67 mm        52        102

So we are running short of options for implementing a generic approach, because the person who generated these files was not consistent.

I cannot spend much more time on this, but at this stage the simplest solution I guess is to pre-process the header and replace 'Std Dev' with 'Std' when found:

 parts{1} = strrep( parts{1}, 'Std Dev', 'Std' ) ;
 blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' ) ;

and to pre-process the content as well to insert extract spaces in known faulty situations, e.g. '60.67+34e % 5.6' so it becomes '60.67+34e % 5.6', which can be achieved by replacing all '% ' by '% ':

 parts{2} = regexprep(  parts{2}, '% ', '%  ' ) ;
 blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' );

Finally, I see that you are using keyboard for stopping the code at strategical places, but you could use the debugger instead. If you click on the dash just aside a line number in the editor (or if you press F12 when the cursor is on this line), you toggle (on/off) a break point. You could set one at the beginning of your code or in the loop for example. Then you run the code and MATLAB will stop at the break point in the debugger (the command prompt will become K>>, for "keyboard"). You will see a green arrow that indicates where the you are in the code. When you are in the debugger, you can run your program line by line by pressing F10.

Mekala balaji le 28 Sep 2017

Ouvrir dans MATLAB Online

First part is working well, but 2nd part still have some problem, kindly help sir, I tried a lot, really need your very big-hearted help,

First part answer(OK):

blockContent = 
      'Param1.pipe'           '0.56 cm'       '0.67 mm'        '3.3'
      '10 Un Para2.pipe'      '15.4 mm'       '1.4 mm'         'NA' 
      'Verti 2 mixing.dis'    '0.0e+1.9 %'    '60.67+34e %'    '5.6'
      'Rate.alarm times'      'yes'           'False End'      '8.9'

2nd part answer:

blockContent2

blockContent2 =

Columns 1 through 14
    'Param1.pipe'    '0.56 cm'    '60 kk'    '0.67 mm'    '52'    '102'    '10 Un Para2.pipe'    '15.4 mm'    'NA'    '1.4 mm'    '69'    '21'    'Verti 2 mixing.dis'    '0.0e+1.9 %'
Columns 15 through 22
      [1x21 char]    '63'    '12'    'Rate.alarm times'    'yes'    'Unrelease False End'    '102'    '201'

Aafter this I got an error:

??? Error using ==> reshape Product of known dimensions, 6, not divisible into total number of elements, 22.

Error in ==> readtTextUnformatted at 43 blockContent2 = reshape( blockContent2, 6, [] ).';

it shoube devided into (4 by 6 matrix)

Cedric le 28 Sep 2017

Ouvrir dans MATLAB Online

Now you have to implement specific corrections before parsing using two or more white spaces as a separator. Each time RESHAPE crashes, it means that there isn't the correct number of elements, which means that at some place there were fewer than two white spaces for separating columns (or that inside a same column there was more than one white space). In the case above, you can see that 'Unrelease False End' was not split because there is just one white space after "Unrelease". This can be treated specifically by replacing all strings 'Unrelease' with 'Unrelease ' (notice the space at the end). See how I implemented specific corrections before parsing the header and before parsing the content (and listed the corrections in the comments to keep track):

    parts = regexp( blocks{blockId}, '--+', 'split') ;
    % Update header to address special cases:
    %   - 'Std Dev'   : replace with 'Std' to avoid generating 2 cols.
    parts{1} = strrep( parts{1}, 'Std Dev', 'Std' ) ;
    blockHeader = regexp( strtrim( parts{1} ), '\s+', 'split' ) ;
    % Update content to address special cases:
    %   - '0.0e+1.9 % Un'   : add extra space after all '%'.
    %   - 'Unrelease 60.'   : add extra space after all 'Unrelease'.
    parts{2} = regexprep( parts{2}, 'Unrelease', 'Unrelease ' ) ;
    parts{2} = regexprep( parts{2}, '% ', '%  ' ) ;
    blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' ) ;
    nCols = numel( blockHeader ) ;
    blockContent = reshape( blockContent, nCols, [] ).' ;
    nRows = size( blockContent, 1 ) ;
    for rowId = 1 : nRows
        for colId = 2 : nCols   % Starting at 2, first is item name.
            num = str2double( strsplit( blockContent{rowId,colId}, ' ' )) ;
            if ~isnan( num(1) )
                blockContent{rowId,colId} = num(1) ;
            end
        end
    end

Now you have to spot/analyze the next faulty cases and extend the number of corrections if needed.

Mekala balaji le 28 Sep 2017

It's my great privilege to you, Sir. It's really solved my problem,

Cedric le 28 Sep 2017

Awesome, congratulations for what you have learned in the process!

Connectez-vous pour commenter.

How to read text files from each subfolder

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponse acceptée

22 commentaires
Afficher 20 commentaires plus anciens Masquer 20 commentaires plus anciens

Plus de réponses (0)

Catégories

Produits

Tags

Community Treasure Hunt

How to read text files from each subfolder

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponse acceptée

22 commentaires Afficher 20 commentaires plus anciens Masquer 20 commentaires plus anciens

Plus de réponses (0)

Catégories

Produits

Tags

Voir également

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

22 commentaires
Afficher 20 commentaires plus anciens Masquer 20 commentaires plus anciens