How to read text files from each subfolder

Hi,
I have a main folder which contains several sub folders, now I want to read text files from each subfolder, save the data into ".xlsx" of each subfolders data by its subfolder name. For example read data from subfolder1 and save the data as "subfolder1.xlsx", and subfolder2 data as "subfolder2.xlsx".
Read text files, and extract the data as mentioned below:
1. Above the first dotted (-----------) line, extract the information: RainFallID, IINT, Rain Result, Start Time
2. Between two dotted lines(-------), the first column and 3rd column, in 3rd column if the data is mixed only keep the first part (for example 0.67 mm--> 0.67, 60.67+34e %-->60.67+34e, and if it is text like "False End"-->False End).
Please help some one kindly,

 Réponse acceptée

Cedric
Cedric le 9 Sep 2017
Modifié(e) : Cedric le 9 Sep 2017
Try something along this line:
% - Define output header.
header = {'RainFallID', 'IINT', 'Rain Result', 'Start Time', 'Param1.pipe', ...
'10 Un Para2.pipe', 'Verti 2 mixing.dis', 'Rate.alarm times'} ;
nHeaderCols = numel( header ) ;
% - Build listing sub-folders of main folder.
D_main = dir( 'Mainfolder' ) ;
D_main = D_main(3:end) ; % Eliminate "." and ".."
% - Iterate through sub-folders and process.
for dId = 1 : numel( D_main )
% - Build listing files of sub-folder.
D_sub = dir( fullfile( 'Mainfolder', D_main(dId).name, '*.txt' )) ;
nFiles = numel( D_sub ) ;
% - Prealloc output cell array.
data = cell( nFiles, nHeaderCols ) ;
% - Iterate through files and process.
for fId = 1 : nFiles
% - Read input text file.
inLocator = fullfile( 'Mainfolder', D_main(dId).name, D_sub(fId).name ) ;
content = fileread( inLocator ) ;
% - Extract relevant data.
rainfallId = str2double( regexp( content, '(?<=RainFallID\s+:\s*)\d+', 'match', 'once' )) ;
iint = regexp( content, '(?<=IINT\s+:\s*)\S+', 'match', 'once' ) ;
rainResult = regexp( content, '(?<=Rain Result\s+:\s*)\S+', 'match', 'once' ) ;
startTime = strtrim( regexp( content, '(?<=Start Time\s+:\s*).*?(?= -)', 'match', 'once' )) ;
param1Pipe = str2double( regexp( content, '(?<=Param1.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
tenUn = str2double( regexp( content, '(?<=10 Un Para2.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
verti2 = regexp( content, '(?<=Verti 2 mixing.dis\s+\S+\s%\s+)\S+', 'match', 'once' ) ;
rateAlarm = strtrim( regexp( content, '(?<=Rate.alarm times\s+\S+\s+)[^\r\n]+', 'match', 'once' )) ;
% - Populate data cell array.
data(fId,:) = {rainfallId, iint, rainResult, startTime, ...
param1Pipe, tenUn, verti2, rateAlarm} ;
end
% - Output to XLSX.
outLocator = fullfile( 'OutputFolder', sprintf( '%s.xlsx', D_main(dId).name )) ;
fprintf( 'Output XLSX: %s ..\n', outLocator ) ;
xlswrite( outLocator, [header; data] ) ;
end
Note that if you have a recent version of MATLAB, you can use the `folder` field of the struct outputed by DIR, and simplify most FULLFILE calls.
EDIT 4:09pm
Just a few extra comments. While it may look complicated, you should be fine with most of the code here. The general approach is
Iterate through sub folders of 'Mainfolder'
Iterate through files of sub folder
Extract data from file and store in data array
Export data array to relevant Excel file
The part that will likely be the most complex for you is the data extraction. One quick option for this is pattern matching using regular expressions. You can see a series of calls to REGEXP:
.. = regexp( content, pattern, option1, option2, .. )
This extract from content a string that matches the pattern. When you need to export a number we convert it to double using STR2DOUBLE. When it may capture extra white spaces we trim it using STRTRIM.
Regular expressions are a big topic, so it is normal if you don't really understand the patterns. In short,
aAb,123 etc : are literals; they are simply matched and they don't have
any special meaning
\s, \S, \d : match a single white-space, non white-space, numeric digit respectively
*, + : mean zeros or more, and one or more respectively times the pattern that precedes
\d+ hence means one or more numeric digit
[..], [^..] : defines a set of characters (-sets) to match or not to match respectively
[\d\s]+ hence means one or more element of either \d or \s
(?<=..) : defines a look behind
(?<=hello )world matches 'world' when it is preceded by 'hello '
(?=..) : defines a look forward
hello(?= world) matches 'hello' when it is followed by ' world'
. : matches any character. To match a the character '.', it has to be escaped with \
.[\d\.]+ matches any character followed by one or more characters that
are either a numeric digit or a '.'
Given this information, you can understand the pattern for extracting e.g. the value of IINT:
'(?<=IINT\s+:\s*)\S+'
which is, match
(?<=..)\S+ : one or more non white-space preceded by something
and the something is
IINT\s+:\s* : the literal 'IINT' followed by one or more white-spaces,
followed by the literal ':', followed by zero or more white-spaces
Cheers,
Cedric

22 commentaires

José-Luis
José-Luis le 21 Sep 2017
I wish I had the patience to write answers as detailed as this.
Accepting instead of +1 because Mekala doesn't seem to care.
Cedric
Cedric le 21 Sep 2017
Thank you José-Luis. Sadly, the experience demonstrates that you are actually very wise not having the patience!
OCDER
OCDER le 21 Sep 2017
Nice answer! Upvoted since it really explains a lot.
Mekala balaji
Mekala balaji le 22 Sep 2017
Modifié(e) : Cedric le 22 Sep 2017
Sir,
Kindly forgive me, it really nice code one I have seen so far. One kind clarification:
1. Can I make it generalized form: like extract all parameter data in colum1 and colum3
to generalize: in each iteration, and each text file, just extract all parameter data:
Generalize the following lines:
rainfallId = str2double( regexp( content, '(?<=RainFallID\s+:\s*)\d+', 'match', 'once' )) ;
iint = regexp( content, '(?<=IINT\s+:\s*)\S+', 'match', 'once' ) ;
rainResult = regexp( content, '(?<=Rain Result\s+:\s*)\S+', 'match', 'once' ) ;
startTime = strtrim( regexp( content, '(?<=Start Time\s+:\s*).*?(?= -)', 'match', 'once' )) ;
param1Pipe = str2double( regexp( content, '(?<=Param1.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
tenUn = str2double( regexp( content, '(?<=10 Un Para2.pipe\s+[\d\.]+\s+\w+\s+)[\d\.]+', 'match', 'once' )) ;
verti2 = regexp( content, '(?<=Verti 2 mixing.dis\s+\S+\s%\s+)\S+', 'match', 'once' ) ;
rateAlarm = strtrim( regexp( content, '(?<=Rate.alarm times\s+\S+\s+)[^\r\n]+', 'match', 'once' )) ;
next step: filter the data according to the list:
The list:
RainFallID
IINT
Rain Result
Start Time
Param1.pipe
10 Un Para2.pipe
Verti 2 mixing.dis
Rate.alarm times
Cedric
Cedric le 22 Sep 2017
Modifié(e) : Cedric le 22 Sep 2017
The code that you copied extracts relevant data and converts data to numeric when necessary. What do you mean by "filter"? Can you give an example?
Mekala balaji
Mekala balaji le 23 Sep 2017
Modifié(e) : Cedric le 23 Sep 2017
Sir,
My main purpose is my text file size is even bigger and I have more than 100 parameters, I just want to know rather to write 100 line to extract each parameter, will there be any general code(in just a few lines) which parse all parameters, and then select the required one.
for example: Between two dotted lines(-------) Extracts the first columns and 3rd column of all existing paramters as below:
Original:
Param1.pipe 0.56 cm 0.67 mm
10 Un Para2.pipe 15.4 mm 1.4 mm
Verti 2 mixing.dis 0.0e+1.9 % 60.67+34e %
Rate.alarm times yes False End
Step1: extracts all parameters:
data_temp:
Param1.pipe 0.67
10 Un Para2.pipe 1.4
Verti 2 mixing.dis 60.67+34e
Rate.alarm times False End
Step2:Just extracts the required data (matching with list) the list is:
Param1.pipe
Rate.alarm times
So, final data:(may be using ismember function)
Param1.pipe 0.67
Rate.alarm times False End
Many thanks Sir,
Cedric
Cedric le 23 Sep 2017
There is no generic way to extract everything if parameters can have mixed values (string, numeric, both), and sometimes multiple per column (e.g. the False End above).
Here I used pattern matching to match very specific parts because this is what you asked initially. If you want something more general, it is likely that you will have to program a little more and implement tests (e.g. trying to convert to numeric, save if ok, save as string if not).
The best you can do at this stage is to attach an original file with the more than 100 parameters, so I can look whether we can exploit some regularity or not.
Mekala balaji
Mekala balaji le 25 Sep 2017
Sir, I do follow your instructions, try by myself, and if I encounter any difficulty, I seek your help.
Mekala balaji
Mekala balaji le 25 Sep 2017
Modifié(e) : Cedric le 25 Sep 2017
Sir,
How to define to take the data from which column, here I need data from column1 and column3. Suppose if I want data from column1 & column4, suppose for parameter:
verti2 = regexp( content, '(?<=Verti 2 mixing.dis\s+\S+\s%\s+)\S+', 'match', 'once' ) ;
Which line do I need to change?
Cedric
Cedric le 25 Sep 2017
In the data set that you provided, there is no line with a column 4. Can you provide an example of such a line?
Sir,
For example:
from below: I want to extract the first column and & 3rd column (between first & 2nd dottoed lines, which is my previous requirement)
Param1.pipe 0.56 cm 0.67 mm
10 Un Para2.pipe 15.4 mm 1.4 mm
Verti 2 mixing.dis 0.0e+1.9 % 60.67+34e %
Rate.alarm times yes False End
and in the same text file,
Item Min Max Std Avg
-----------------------------------------------------------------------
Param1.pipe 0.56 cm 0.67 mm 52 102
10 Un Para2.pipe 15.4 mm 1.4 mm 69 21
Verti 2 mixing.dis 0.0e+1.9 % 60.67+34e % 63 12
Rate.alarm times yes False End 102 201
-----------------------------------------------------------------------
I want the data in first, 3rd,4th & 5th column.
I attached one text file output results.
Cedric
Cedric le 25 Sep 2017
Modifié(e) : Cedric le 25 Sep 2017
The problem is that we could match one number per row in the first version of your question, but now you want to extract multiple.
This is more complex and I cannot do the whole thing for you. I can give you hints but you will have to try programming that by yourself.
If I had to extract all parameter names and all values, plus the header, I would build some flexible code that is able to extract an Item array whatever its number of rows and columns. They are not really well formatted but I think that we can safely assume that we can use "more than one white space" as a column separator.
I can show you how to start but then you have to integrate the approach into your existing code.
Let's train with the content of one specific file:
>> content = fileread( 'RainFallReport3.txt' ) ;
we can proceed with a nest approach. First we try to target the item blocks:
>> blocks = regexp( content, 'Item.*?---.*?\s---', 'match' )
which outputs a cell array with two cells. If we look at the first we recognize the first Items block:
>> blocks{1}
ans =
Item Open Close
----------------------------------------------------------
Param1.pipe 0.56 cm 0.67 mm
10 Un Para2.pipe 15.4 mm 1.4 mm
Verti 2 mixing.dis 0.0e+1.9 % 60.67+34e %
Rate.alarm times yes False End
---
and if we look at blocks{2} we recognize the second Items block. Now we can develop code for extracting blocks content. First we try to split the block into a header and a content. We train with block 1:
>> parts = regexp( blocks{1}, '--+', 'split ') ;
if you look at parts{1} you will recognize the header, parts{2} the content, and parts{3} can be discarded. Now both the header string and the content string can be split using "more than one white space" as a separator. We trim them for extra white spaces first to avoid having extra empty cells:
>> blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' )
blockHeader =
1×4 cell array
'Item' 'Open' 'Close'
and same for the content:
>> blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' )
blockContent =
1×12 cell array
Columns 1 through 6
'Param1.pipe' '0.56 cm' '0.67 mm' '10 Un Para2.pipe' '15.4 mm' '1.4 mm'
Columns 7 through 12
'Verti 2 mixing.dis' '0.0e+1.9 %' '60.67+34e %' 'Rate.alarm times' 'yes' 'False End'
which is a 1x12 cell array with all the entries. We can reshape it based on the relevant number of columns that we know from the header:
>> nCols = numel( blockHeader ) ;
>> blockContent = reshape( blockContent, nCols, [] ).'
blockContent =
4×3 cell array
'Param1.pipe' '0.56 cm' '0.67 mm'
'10 Un Para2.pipe' '15.4 mm' '1.4 mm'
'Verti 2 mixing.dis' '0.0e+1.9 %' '60.67+34e %'
'Rate.alarm times' 'yes' 'False End'
Then we can process the data. It seems that if it can be split into a number and a unit, you want to save the number, and otherwise you want to save whole string:
nRows = size( blockContent, 1 ) ;
for rowId = 1 : nRows
for colId = 2 : nCols % Starting at 2, first is item name.
num = str2double( strsplit( blockContent{rowId,colId}, ' ' )) ;
if ~isnan( num(1) )
blockContent{rowId,colId} = num(1) ;
end
end
end
After this, you can see the update of the content:
>> blockContent
blockContent =
4×3 cell array
'Param1.pipe' [ 0.5600] [ 0.6700]
'10 Un Para2.pipe' [ 15.4000] [ 1.4000]
'Verti 2 mixing.dis' '0.0e+1.9 %' '60.67+34e %'
'Rate.alarm times' 'yes' 'False End'
and then you have to code the rest, which is more standard MATLAB, to create the row of data based on all information from all blocks store it in a cell array that you will output to CSV or XLSX ultimately.
Note that this approach answers one of your other question: how to process these Item tables without hard-coding patterns for every item.
So all in all, you already have a loop over folders and files, which reads a file into a char variable content. What you have to do is to integrate what I just explained in this loop, and implement what is missing:
for folderId = ...
...
for fileId = ....
content = fileread( ... ) ;
...
% - Code for processing Item blocks.
blocks = regexp( content, 'Item.*?---.*?\s---', 'match' ) ;
for blockId = 1 : numel( blocks )
parts = regexp( blocks{blockId}, '--+', 'split ') ;
blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' ) ;
blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' ) ;
nCols = numel( blockHeader ) ;
blockContent = reshape( blockContent, nCols, [] ).' ;
nRows = size( blockContent, 1 ) ;
for rowId = 1 : nRows
for colId = 2 : nCols % Starting at 2, first is item name.
num = str2double( strsplit( blockContent{rowId,colId}, ' ' )) ;
if ~isnan( num(1) )
blockContent{rowId,colId} = num(1) ;
end
end
end
% * your job *
end
end
...
end
Jan
Jan le 27 Sep 2017
@Cedric: "EDIT 4:09pm" is not necessarily clear to the readers due to the time zones. I prefer getting UTC from e.g. https://time.is/UTC .
Cedric
Cedric le 27 Sep 2017
Thank you for the link and for the boost, Jan!
Mekala balaji
Mekala balaji le 28 Sep 2017
Modifié(e) : Cedric le 28 Sep 2017
Sir,
The above code works perfectly, but I have a new type of text file, and I face difficulty in reading it. It gives me below Error
??? Error using ==> reshape
Product of known dimensions, 4, not divisible into total number of elements, 15.
Error in ==> readTextFromSubfolder2_multiLines_main at 32
blockContent = reshape( blockContent, 4, [] ).';
Because in blockHeader line: "Close end" treating as one word, then I specified manually nCols (as 4). But again, but still, it can not work, I seek your kind help in this regards,
I use the following code:
close all
clear all
clc
Maifolder='D:\Mekala_Backupdata\Matlab2010\Mainfolder4';
D_main = dir('Mainfolder') ;
D_main = D_main(3:end) ; % Eliminate "." and ".."
% keyboard
% - Iterate through sub-folders and process.
for dId = 1 : numel( D_main )
% - Build listing files of sub-folder.
D_sub = dir( fullfile('D:\Mekala_Backupdata\Matlab2010\Mainfolder4', D_main(dId).name, '*.txt' )) ;
nFiles = numel( D_sub ) ;
% keyboard
% - Prealloc output cell array.
% data = cell( nFiles, nHeaderCols ) ;
% - Iterate through files and process.
for fId = 1 : nFiles
% - Read input text file.
inLocator = fullfile( 'D:\Mekala_Backupdata\Matlab2010\Mainfolder4', D_main(dId).name, D_sub(fId).name ) ;
% keyboard
content = fileread( inLocator ) ;
blocks = regexp( content, 'Item.*?---.*?\s---', 'match' );
% blocks = regexp( content, 'Item.*?---.*?\s---', 'match' );
% keyboard
parts = regexp(blocks{1}, '--+', 'split') ;
blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' );
keyboard
blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' );
nCols = numel( blockHeader ) ;
% blockContent = reshape( blockContent, nCols, [] ).';
blockContent = reshape( blockContent, 4, [] ).';
end
end
OCDER
OCDER le 28 Sep 2017
Modifié(e) : OCDER le 28 Sep 2017
Hi Mekala, this seems like a different problem than "How to read text files from each subfolder?" Post a new question like "Error reshaping data from text file". And then provide all the error message, necessary codes, and txt file like you did in these comments. HOWEVER, remove unused lines of codes - this is called code hoarding and should be avoided. Example of code hoarding:
% blockContent = reshape( blockContent, nCols, [] ).'; DELETE THIS IF UNUSED
blockContent = reshape( blockContent, 4, [] ).';
Remember, you want the reader to have an EASY time identifying your problem, and ONE specific problem. Otherwise, this thread will never stop, like an energizer bunny... Also, remember to thank the answerer by accepting the correct answer too - they earn reputation points and are not left to wonder if things worked out.
Cedric
Cedric le 28 Sep 2017
Modifié(e) : Cedric le 28 Sep 2017
The problem is that your input files have a lot of discrepancies and differences. The last version had
Item Min Max Std Avg
--------------------------------------------------------------------------
Param1.pipe 0.56 cm 0.67 mm 52 102
10 Un Para2.pipe 15.4 mm 1.4 mm 69 21
Verti 2 mixing.dis 0.0e+1.9 % 60.67+34e % 63 12
Rate.alarm times yes False End 102 201
------------------------------------------------------------------------------
where you can observe that we cannot assume that columns content is well aligned (column Std is right justified and column Avg is left justified apparently), so we cannot use positions as columns delimiters.
Also, columns have content that contain multiple parts separated by a white space. One a the few options that we had for separating columns was to count on the fact that more than one white space seemed to be separating columns. But in your new version we have:
Item Open Close end
--------------------------------------------------------------
Param1.pipe 0.56 cm 0.67 mm 3.3
10 Un Para2.pipe 15.4 mm 1.4 mm NA
Verti 2 mixing.dis 0.0e+1.9 % 60.67+34e % 5.6
Rate.alarm times yes False End 8.9
where you apparently have four columns but sometimes a single space as a separator. Based on this, we could have defined that any white space is a separator in the header, but that can not work because sometimes you seems to have Std in the header and other times Std dev :
Item Min Program Max Std Dev Avg
----------------------------------------------------------------------------------
Param1.pipe 0.56 cm 60 kk 0.67 mm 52 102
So we are running short of options for implementing a generic approach, because the person who generated these files was not consistent.
I cannot spend much more time on this, but at this stage the simplest solution I guess is to pre-process the header and replace 'Std Dev' with 'Std' when found:
parts{1} = strrep( parts{1}, 'Std Dev', 'Std' ) ;
blockHeader = regexp( strtrim( parts{1} ), '\s{2,}', 'split' ) ;
and to pre-process the content as well to insert extract spaces in known faulty situations, e.g. '60.67+34e % 5.6' so it becomes '60.67+34e % 5.6', which can be achieved by replacing all '% ' by '% ':
parts{2} = regexprep( parts{2}, '% ', '% ' ) ;
blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' );
Finally, I see that you are using keyboard for stopping the code at strategical places, but you could use the debugger instead. If you click on the dash just aside a line number in the editor (or if you press F12 when the cursor is on this line), you toggle (on/off) a break point. You could set one at the beginning of your code or in the loop for example. Then you run the code and MATLAB will stop at the break point in the debugger (the command prompt will become K>>, for "keyboard"). You will see a green arrow that indicates where the you are in the code. When you are in the debugger, you can run your program line by line by pressing F10.
First part is working well, but 2nd part still have some problem, kindly help sir, I tried a lot, really need your very big-hearted help,
First part answer(OK):
blockContent =
'Param1.pipe' '0.56 cm' '0.67 mm' '3.3'
'10 Un Para2.pipe' '15.4 mm' '1.4 mm' 'NA'
'Verti 2 mixing.dis' '0.0e+1.9 %' '60.67+34e %' '5.6'
'Rate.alarm times' 'yes' 'False End' '8.9'
2nd part answer:
blockContent2
blockContent2 =
Columns 1 through 14
'Param1.pipe' '0.56 cm' '60 kk' '0.67 mm' '52' '102' '10 Un Para2.pipe' '15.4 mm' 'NA' '1.4 mm' '69' '21' 'Verti 2 mixing.dis' '0.0e+1.9 %'
Columns 15 through 22
[1x21 char] '63' '12' 'Rate.alarm times' 'yes' 'Unrelease False End' '102' '201'
Aafter this I got an error:
??? Error using ==> reshape Product of known dimensions, 6, not divisible into total number of elements, 22.
Error in ==> readtTextUnformatted at 43 blockContent2 = reshape( blockContent2, 6, [] ).';
it shoube devided into (4 by 6 matrix)
Now you have to implement specific corrections before parsing using two or more white spaces as a separator. Each time RESHAPE crashes, it means that there isn't the correct number of elements, which means that at some place there were fewer than two white spaces for separating columns (or that inside a same column there was more than one white space). In the case above, you can see that 'Unrelease False End' was not split because there is just one white space after "Unrelease". This can be treated specifically by replacing all strings 'Unrelease' with 'Unrelease ' (notice the space at the end). See how I implemented specific corrections before parsing the header and before parsing the content (and listed the corrections in the comments to keep track):
parts = regexp( blocks{blockId}, '--+', 'split') ;
% Update header to address special cases:
% - 'Std Dev' : replace with 'Std' to avoid generating 2 cols.
parts{1} = strrep( parts{1}, 'Std Dev', 'Std' ) ;
blockHeader = regexp( strtrim( parts{1} ), '\s+', 'split' ) ;
% Update content to address special cases:
% - '0.0e+1.9 % Un' : add extra space after all '%'.
% - 'Unrelease 60.' : add extra space after all 'Unrelease'.
parts{2} = regexprep( parts{2}, 'Unrelease', 'Unrelease ' ) ;
parts{2} = regexprep( parts{2}, '% ', '% ' ) ;
blockContent = regexp( strtrim( parts{2} ), '\s{2,}', 'split' ) ;
nCols = numel( blockHeader ) ;
blockContent = reshape( blockContent, nCols, [] ).' ;
nRows = size( blockContent, 1 ) ;
for rowId = 1 : nRows
for colId = 2 : nCols % Starting at 2, first is item name.
num = str2double( strsplit( blockContent{rowId,colId}, ' ' )) ;
if ~isnan( num(1) )
blockContent{rowId,colId} = num(1) ;
end
end
end
Now you have to spot/analyze the next faulty cases and extend the number of corrections if needed.
Mekala balaji
Mekala balaji le 28 Sep 2017

It's my great privilege to you, Sir. It's really solved my problem,

Cedric
Cedric le 28 Sep 2017
Awesome, congratulations for what you have learned in the process!

Connectez-vous pour commenter.

Plus de réponses (0)

Catégories

En savoir plus sur Characters and Strings dans Centre d'aide et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by