Parsing a text file in matlab and accessing contents of each sections

yashvin le 10 Juin 2015
Commenté : yashvin le 12 Juin 2015
Hi I want to separate a text file into different sections in MATLAB which is quite big.
- Ignore first set of lines
- Then the data set is repeated
- Access its content for a particular set of condition
For example, for a drag factor of 1.0 and fuel factor of 1.2, I want to find the corresponding alt for a particular weight.
Find attached the text file.
Thanks Yashvin
per isakson
per isakson le 10 Juin 2015
Modifié(e) : per isakson le 10 Juin 2015
  • "quite big" &nbsp how big compared to available memory?
  • "different sections" &nbsp what defines the beginning of a section? "V2500_A5"_ is that a fixed string, which defines the beginning of a new a section?
yashvin le 10 Juin 2015
It is 60mb of txt file. As an example, I am attaching a full section of a part of the txt file. The initial section until "Cruise at a given cost index" is unimportant.
Each section begins with "CLEAN CONFIGURATION" followed by a table.
For example, for drag factor=1,fuel factor=1,2 and ISA= =13,I want to access the table and get the corresponding weight.
All the parameters in the 'CLEAN CONFIGURATION', i want to treat them as field so that I can select for different conditions

Réponse acceptée

per isakson
per isakson le 10 Juin 2015
Modifié(e) : per isakson le 11 Juin 2015
Here is a function, which reads question2.txt and returns a struct vector. It might serve as a starting point.
>> out = cssm()
out =
1x2 struct array with fields:
>> out(abs([out.DRAG_FACTOR]-1)<1e-6 & abs([out.FUEL_FACTOR]-1)<1e-6).Table(1:5,1:3)
ans =
1.0e+04 *
4.0000 0.0000 0.0211
4.0500 0.0000 0.0212
4.1000 0.0000 0.0213
4.1500 0.0000 0.0214
4.2000 0.0000 0.0215
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.Table = excerpt_table_( str );
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numerical sequence, which is at least 100 character
% long. PROBLEM: requires that the preceding line ends with a "non-numerical"
% character and that the following line begins with a "non-numerical" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
Modified function based on comment
>> cssm
ans =
1x2 struct array with fields:
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[] , 'ALTITUDE' ,[], 'ISA' ,[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[], 'ALTITUDE' ,[], 'ISA' ,[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.COST_INDEX = excerpt_colon_separated_num_( str, 'COST INDEX' );
sas.ALTITUDE = excerpt_colon_separated_num_( str, 'ALTITUDE' );
sas.ISA = excerpt_colon_separated_num_( str, 'ISA' );
sas.Table = excerpt_table_( str );
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numeric sequecne, which is at least 100 character
% long. PROBLEM: requires that the preceeding line ends with a "non-numeric"
% character and that the following line begins with a "non-numeric" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
function val = excerpt_colon_separated_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '(?:[ \:\-]+)([\d\.])+' ], 'tokens', 'once' );
val = str2double( buf{:} );
per isakson
per isakson le 11 Juin 2015
@Guillaume, yes the two text files differed. The first is a stripped down version of the second. I attach the copies I used.
yashvin le 12 Juin 2015
HI! Do you still have the file? Yes! Now its clearer to me! Thanks so much! Yes both your answer were very helpful! I am getting used to it now. The first answer was of higher level! Thank you both for your contribution!

Plus de réponses (1)

Guillaume le 10 Juin 2015
Your text file is not really designed to be read by a computer. It's not very consistent (variable number of blank lines, variable number of spaces, inconsistent number format, etc.) which makes it difficult to parse efficiently.
So the first thing to look at is if you can get the same data in a format designed to be parsed by a computer: binary, json, xml, etc.
Failing that, the following works on the attached file, but because of the inconsistencies may not work on a larger file:
dragwanted = 1.0;
fuelwanted = 1.2;
content = fileread('question.txt'); %get whole content of file
sections = regexp(content, 'DRAG FACTOR\s+([0-9.]+)\s+FUEL FACTOR\s+([0-9.]+)\s+([A-Z .]+\r\n[A-Z() ]+\r\n\s*\r\n([0-9. ]+\r\n)+)', 'tokens');
%sections is a cell array of 1x3 cell arrays of {drag factor, fuel factor, table}
dragfactors = cellfun(@(s) str2double(s{1}), sections);
fuelfactors = cellfun(@(s) str2double(s{2}), sections);
wanted = dragfactors == dragwanted & fuelfactors == fuelwanted;
assert(sum(wanted) > 0, 'No section match criteria');
assert(sum(wanted) == 1, 'More than one section match criteria');
section = sections{wanted}{3};
%parse the section:
sectionlines = strsplit(section, {'\n', '\r'});
sectionheader = strsplit(strtrim(sectionlines{1}))
sectionunits = strtrim(regexp(sectionlines{2}, '(?<=\().*?(?=\))', 'match'))
sectiontable = str2num(strjoin(sectionlines(4:end-1), '\n'))
  6 commentaires
yashvin le 10 Juin 2015
Now I am understanding it better thanks to you! So, in fact, the list of condition before the table can be any one of them. Infact, it can also be CG location percentage, altitude value, ISA number(positive or negative),cost index value or % of MCR thrust.
In the file, in each sections, we care only from the CLEAN CONFIGURATION to the last value of the table. The remaining can be discarded.
The table always start by WGHT and the header stays same. Yes, the unit should be kept.
Thanks Yashvin
Guillaume le 10 Juin 2015
Your file is a real mess, sometimes you have empty lines with just one space, sometimes with no spaces, the header line starts with 3 spaces, the unit line only two, the parameter section sometimes has one parameter on a line, sometimes two. You may be better off parsing the file line by line.
Otherwise, the following will get you the table and the criteria section, but will not parse the criteria:
sections = regexp(content, ...
'CLEAN CONFIGURATION\r\n((.*\r\n)+?)(\s+WGHT.*\r\n.*\r\n.*\r\n([0-9. ]+\r\n)+)', ...
'tokens', 'dotexceptnewline);
sections is a 1 x n (n = number of section) cell array of cell arrays whose first elements are the criteria part and seconds elements the table part. You can parse the table with the same code as before. For reference, the above regular expression can be decoded as:
  1. match 'CLEAN CONFIGURATION' followed by '\r' (newline)
  2. starts the first token (at |(|)
  3. match any character but a newline followed by '\r' (the |(.*\r

Translated by