How do I parse this complex text file with textscan?

I have a text file that is in a rather funky format. The file comes out of a relational database (Antelope) and consists of earthquake location, dates, times, phase information, etc. I need to parse out and collect the 'data blocks' that are in between each header line. I need the header lines as well for each "block". I have edited the file to include an EOB (end of block) marker to make this task easier, but it's not as trivial as I thought. Here's an image of the first 68 or so lines (out of about 1 million).
I'd like to pull the 4 columns below each header.... for example the first section is:
2015 1 22 0 8 58.537 45.97929 -129.98717 1.184 0.0 1.039 3.621 0.036 1
AXCC1 0.843 1.00 P
AXAS2 1.263 1.00 P
AXEC1 0.923 1.00 P
AXEC2 1.103 1.00 P
AXEC3 1.088 1.00 P
AXCC1 1.873 0.25 S
AXAS1 2.728 0.06 S
AXAS2 2.168 0.25 S
AXEC1 1.708 0.33 S
AXEC2 2.043 0.25 S
AXEC3 2.113 0.25 S
and put those in an array. But I need to be able to associate the header line, specifically the last integer in the header line (1 in this case), with each code block.
So far my code looks like this, but obviously it is not working yet. I don't get any errors but it's missing and skipping data etc.
fid=fopen('ph2dt_catalog8_edit.dat');
Block=1;
while (~feof(fid))
InputText=textscan(fid,'%s',1,'delimiter','\n');
HeaderLines{Block,1}=InputText{1};
disp(HeaderLines{Block});
FormatString='%s%f%f%s';
InputText=textscan(fid, FormatString, 'delimiter','WhiteSpace','CollectOutput',1);
Data{Block,1} = cell2mat(InputText{2});
[NumRows,NumCols] = size(Data{Block});
eob=textscan(fid,'%s',1,'delimiter','\n');
Block=Block +1;
end
Can anyone offer any suggestions. Let me know if I need to clarify anything further.

14 commentaires

dpb
dpb le 16 Nov 2016
Is there any way when generating the file to get the number of records in the block? If that could be added into either the existing header line as another variable or as a separate record for each block when generating the report it would make parsing "on the fly" much simpler; you could have a specific count for each block.
Will wait to find out if this is possible before thinking much about other options...
It appears that you start with a header to a block (and that header appears to start with a year). Now, within each block after the header, are there always the same number of reporting station entries for any one file? Is that number fixed (same for every file) ?
dpb
dpb le 17 Nov 2016
It's not in the sample, Walter. There are 11,10,11,12,10,... per block which is why I asked if he could get the database query to also return that value...
psprinks
psprinks le 17 Nov 2016
Modifié(e) : psprinks le 17 Nov 2016
Thank you Walter and dpb. I will check in the morning to see if I can get the database to write the number of station records to the file. I can also add some screen shots of the output I'm getting as well if that would help.
dpb
dpb le 17 Nov 2016
Screen shots "not so much". A short section of the file attached would let others have a shot...
If you can get that additional info then writing a loop for each block and simply parsing that number to use in the textscan format repeat count should make parsing a "piece-o-cake".
How many different stations are represented in the entire file?
dpb
dpb le 17 Nov 2016
You mention there are about 1M records; the records are pretty short, how big is the file (in MB)? It's possible you could scarf it up as character image and do a search for the EOB markers to compute the blocksize. I'm guessing Per's thinking on the lines of regular expressions looking for the groups of lines containing one of the station IDs if that population is known (or knowable) a priori...
psprinks
psprinks le 17 Nov 2016
Modifié(e) : psprinks le 17 Nov 2016
The entire file is about 23 MB. I've attached the first ~ 1000 lines so folks can play.
@per isakson ...there are 73884 "header" lines and 963369 total lines in the file including the lines with EOB at the end of each station block. There are 7 possible ocean bottom seismometer stations ( AXCC1 AXEC1 etc ). But they can be repeated in the blocks...as people have pointed out in some blocks there may be 10, 11, 12, 9, etc records.
Ok so I've switched things up and have some working code, but it's very slow because I'm using fgetl. I took out the EOB markers and added a hash (#) to the header lines. My new question I guess is optimizing this ....any ideas on faster method????
fid=fopen('ph2dt_catalog8.dat');
tline=fgetl(fid);
c=0;
ORG=[]; ARV=[]; STA=[]; PHA=[]; MATDAY_ARV=[]; EVID=[];
while ischar(tline)
% tline=fgetl(fid);
if strcmp(tline(1),'#')
% disp(tline)
S=textscan(tline, '%s %f %f %f %f %f %f %f %f %f %f %f %f %f %f','delimiter', ' ','MultipleDelimsAsOne',1, 'CollectOutput',1 );
ORG=cat(1,ORG, S{2});
X=S{2};
else
S=textscan(tline, '%s %f %f %s','delimiter', ' ','MultipleDelimsAsOne',1);
MATDAY_ARV=datenum(X(1),X(2),X(3),X(4),X(5),X(6)); + S{2}; % matlab time of arrival
ARV=cat(1,ARV,S{2});
STA=cat(1,STA,S{1});
PHA=cat(1,PHA,S{4});
EVID=cat(1,EVID,X(end));
end
tline=fgetl(fid);
end
dpb
dpb le 18 Nov 2016
No joy on being able to generate the block sizes from the database I gather? That'd be the cat's meow methinks...*fgetl* will be slow.
Unfortunately, no....ugh...
The fgetl code works....but as you guys all know it's SUPER slow...I left it running last night and it took ~10 hours. I have the output I need, but would still like to improve the time for future use.
Jan
Jan le 18 Nov 2016
Modifié(e) : Jan le 18 Nov 2016
I do not think that fgetl is the bottleneck here, but the iterative growing of the output. I expect that textscan is slower than fgetl and a specific parsing of the lines, because textscan is so much "smarter". Smartness costs time.
"strcmp(tline(1),'#')" ??? I do not see a "#" in the posted test file. Please post a real data file and explain, which output you want.
@Jan Wow...you are right. fgetl is not slow...I'm perpetuating false rumors.
my code took ~10 hours and yours took about 20 minutes...so that's pretty amazing.
My only concern now is working with the format of the output...cell arrays within cell arrays. Also this code didn't write the header lines to an array( but not a big deal because I have code I can splice in that does that), which I need.
I'm just a geophysicist hack when it comes to coding.
per isakson
per isakson le 19 Nov 2016
Modifié(e) : per isakson le 19 Nov 2016
"I'd like to pull the 4 columns below each header" &nbsp Your script doesn't extract the third column. And what is the intent for &nbsp MATDAY_ARV=datenum(...) ?

Connectez-vous pour commenter.

 Réponse acceptée

per isakson
per isakson le 18 Nov 2016
Modifié(e) : per isakson le 30 Jan 2021
Assumptions
  • Speed is important - "any ideas on faster method?"
  • The text file fits in memory - "The entire file is about 23 MB."
  • The station names are exactly five characters - "5" appears in the code as a magic number
  • The value of PHA is exactly one character
  • The line separator is "", i.e char(10)
  • The header lines begin with 2014,2015,2016 or 2017 (and are the only lines to begin so).
Approach
  • Read the entire file into a character string.
  • Split the string into a cell array of strings, with one block in each cell
  • Pre-allocate output variables based on the size of the string and the cell array
  • Loop over all blocks and parse one block at a time
I tested with community_edit_2.txt, which is community_edit.txt with the # removed.
STA and PHA are character arrays rather than cell arrays of strings. That's somewhat faster
function [ ORG, ARV, STA, PHA, EVD ] = cssm( filespec )
str = fileread( filespec );
xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';
blocks = regexp( str, xpr, 'match' );
nnl = length( strfind( str, char(10) ) );
len = length( blocks );
ORG = nan(len,14);
%
N = nnl - len + 1;
STA = repmat( '-', [N,5] );
ARV = nan(N,1);
PHA = repmat( '-', [N,1] );
EVD = nan(N,1);
nextORG = 1;
nextSTA = 1;
for cac = blocks
S0 = regexp( cac{1}, '\n', 'split', 'once' );
S1 = textscan( S0{1}, '%f%f%f%f%f%f%f%f%f%f%f%f%f%f' ...
, 'CollectOutput',true );
ORG( nextORG, : ) = S1{1};
MATDAY_ARV = datenum( S1{1}(1:6) ); %#ok<NASGU>
nextORG = nextORG + 1;
%
S2 = textscan( S0{2}, '%5c%f%f%1c' );
N2 = size( S2{1}, 1 );
STA( nextSTA:nextSTA+N2-1, : ) = S2{1};
ARV( nextSTA:nextSTA+N2-1, 1 ) = S2{2};
PHA( nextSTA:nextSTA+N2-1, 1 ) = S2{4};
EVD( nextSTA:nextSTA+N2-1, : ) = S1{1}(end);
nextSTA = nextSTA + N2;
end
%
if N >= nextSTA % truncate the "memory", which isn't used.
STA( STA == '-' ) = [];
STA = reshape( STA, [],5 );
ARV( nextSTA : end ) = [];
PHA( nextSTA : end ) = [];
EVD( nextSTA : end ) = [];
end
end
Error handling: This file lacks error handling besides that of Matlab, e.g. fileread will tell if the text file is missing. If this function is intended for routine use it's important to handle especially the errors, which are caused by unexpected character strings in the input file.
2016-11-18, Performance test
  • Computer: eight year old vanilla desktop with 8GB RAM.
  • System: Windows7,64bit, Matlab R2016a,64bit
  • Test file: community_edit_1M.txt is 27.6MB, 95200 blocks, 1097181 lines. It's created by concatenating copies of community_edit.txt and removing the #.
>> filespec = 'h:\m\cssm\community_edit_1M.txt';
>> tic,[ORG0,ARV0,STA0,PHA0,EVD0] = cssm( filespec ); toc
Elapsed time is 22.443859 seconds.
Caveat: The text file was probably available in the system cache, since this was not cleared before the test.
Comparison: This is nearly five times faster than the function, asd
>> filespec = 'h:\m\cssm\community_edit_1M_EOB.txt';
>> tic, [Data, HeaderLines] = asd( filespec ); toc
Elapsed time is 101.202009 seconds.

11 commentaires

thanks per
I'm trying to implement this. It's not throwing any errors but just dumps an empty 0x14 matrix.
Am I missing something?
per isakson
per isakson le 18 Nov 2016
Modifié(e) : per isakson le 19 Nov 2016
Yes, you missed something, but based on your terse comment it's hard to tell anything for sure.
Here the function returns the same result as your script of the comment above. And that's in a fraction of the time.
I guess
  • 14 is the length of the header line
  • you tried the function on a text file, which contained # as the first character of the header lines. I.e. you misunderstood the sentence, "I tested with community_edit_2.txt, which is community_edit.txt with the # removed.", in my answer. Yes, I could have expressed myself clearer.
  • you called the function without output arguments
psprinks
psprinks le 20 Nov 2016
Modifié(e) : psprinks le 20 Nov 2016
Hi Per,
Sorry for the short reply. I'm accepting your answer as it seems to be the fastest and cleanest code. I know it will work the best after I sort out the one error it's currently throwing. Yes, the first time I ran this was with the text file that still had the # as the first character of the header lines. However, when running the file with the # removed I receive this error :
Index exceeds matrix dimensions
Error in cssm (line 27)
S2 = textscan( S0{2}, '%5c%f%f%1c' );
I'm assuming the issue is that S0{2} is empty and so the call to textscan doesn't like that??? I'm not sure what you mean by calling the fcn without output arguments. I thought functions were called using input arguments and the code where you define the function calculates the output variables?
per isakson
per isakson le 20 Nov 2016
Modifié(e) : per isakson le 20 Nov 2016
Hi psprinks
This form of communication isn't always effective. What may take a dozen of comments back and forth could have been solved in seconds in front of a shared screen (and keyboard). Thank you for seeing my blunt marker from the positive side. (I might have express myself better in Swedish.)
Further on communication, am I correct that you don't use the debugging features of Matlab? I deduce that from the word "assuming" in "I'm assuming the issue is that S0{2} is empty". This question intend to illustrate a dilemma rather than asking.
Now, I will start a new comment with technical stuff.
per isakson
per isakson le 20 Nov 2016
Modifié(e) : per isakson le 21 Nov 2016
Matlab has good debugging features, see Debug a MATLAB Program.
No, I cannot tell why you encounter this error. Instead of guessing, I'll try to help you find out.
  • set a breakpoint at line 18
&nbsp
  • start the function. Execution will halt at line 18
  • hover over the variable str. The tooltip will show the value of str, which should be "identical" to the content of the text file. There should not be any # or EOB.
  • double line spacing in the tooltip would indicate that the line separator is "\r", i.e char(13)+char(10). If so, execute double(str(90:145)) and look for the pair 13 10 in the output.
&nbsp
  • hover over the variable blocks. (The exact content of the tooltip may differ between Matlab releases. I use R2016a.)
&nbsp
  • click Step three(?) times
  • hover over the variable S0
&nbsp
  • click Step until line 27
  • select the expression S0{2}
  • right-click and select Evaluate Selection
&nbsp
  • click Step and hover over S2
&nbsp
  • click Quit Debugging in the toolstrip
&nbsp
Now, I hope that you either were able to reproduce these steps or that you identified a difference, which explains why it went wrong.
If you were able to reproduce these steps, the next steps are
  • click Breakpoints in the toolstrip (/toolbar)
  • click Clear All
  • click Stop on Errors
  • run the function and it will halt just in advance of throwing the error
  • hover over str, blocks, cac and S1
  • select the expression S0{2}, right-click and select Evaluate Selection
By now you should know more about the cause of the problem
/over
psprinks
psprinks le 21 Nov 2016
Modifié(e) : psprinks le 21 Nov 2016
Per,
I can't thank you enough for your effort. You are correct this hasn't been the most effective way to communicate, and I'm forgetting that English isn't everyone's first language. (my bad)
So, I am familiar with debugging and I have isolated the problem.
When the code reaches the 997th header line it doesn't read the full line. It stops after it reads the longitude value. So S0 becomes a 1x1 cell instead of the 1x2 that it should be. Therefore, when Matlab gets to S2 it can't evaluate S0{2} because it doesn't exist.
Further, the issue is with:
xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';
What's happening is that anytime the characters 2014, 2015, 2016 or 2017 appear in the header line after the beginning of the line the code is cutting the rest of the header line. This will happen at several points throughout the rest of the code. For example the last characters in the 12015th header line are 2015 and the xpr assignment says to drop those characters and so cac, blocks, S0 aren't correct. I hope this makes sense.
I am attaching a much larger portion of the text file so you can see what I'm seeing.
Also I've put an image of my workspace so you can see that S0{2} doesn't exist.
dpb
dpb le 21 Nov 2016
Is that the header line with 997 as the last entry or the 997th header line read? There doesn't appear to be a complete set, so not sure which.
psprinks
psprinks le 21 Nov 2016
Modifié(e) : psprinks le 21 Nov 2016
@dpb please see my edited comment above...the 997th header line read. I've added a larger .txt file for people to play with.
The issue is the xpr assignment. If at any point after the first 2015 in each header line the numbers 2014, 2015, 2016, or 2017 appear, anything afterwards is dropped. I know what that line of code is trying to achieve, but the syntax is foreign to me, and therefore kind of hard for me to fix. (I do not have a CS background) I know the issue is with the second half of that line, but again I'm not currently able to understand what it's doing really.
per isakson
per isakson le 21 Nov 2016
Modifié(e) : per isakson le 22 Nov 2016
You found a bug in my code and you spotted the erroneous expression: "the characters 2014, 2015, 2016 or 2017 appear in the header line". However, let me show you how I would track it down.
  • set Stop on Errors and run
  • execution halted at line 32
  • select cac{1} and evaluate. The block is truncated in the header line as you already found "reaches the 997th header line it doesn't read the full line."
&nbsp
  • search for the value 45.93929 in the file. There is hopefully only few of it in the file. I use Notepad++ to inspect data files.
&nbsp
The block is truncated just before 2017. And that is done by
xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|[ ]*201[4567]))';
blocks = regexp( str, xpr, 'match' );
The error is in the look ahead part, (?=($|[ ]*201[4567]). It matches 2017 in any position, not only in the beginning of a line. A \n before 2017 is missing. Replace the expression by
xpr = '(?<=(^|\n))[ ]*201[4567].+?(?=($|(\n[ ]*201[4567])))';
which has an extra pair of parentheses for readability. Now the "look ahead" looks for either the end of the entire string or a new line followed by zero or more spaces followed by 201 followed by one of 4567.
Now the function reads the current data file
>> filespec = 'h:\m\cssm\community_20161121.txt';
>> tic,[ORG0,ARV0,STA0,PHA0,EVD0] = cssm( filespec ); toc
Elapsed time is 0.199797 seconds.
>> whos ORG0
Name Size Bytes Class Attributes
ORG0 826x14 92512 double
AWESOME! This literally saved me days of processing time!!!
per isakson
per isakson le 22 Nov 2016
Modifié(e) : per isakson le 22 Nov 2016
I'm glad the function is useful and will be used!
You had already spotted the expression with the bug: "the characters 2014, 2015, 2016 or 2017 appear in the header line". I could have save me the details in my last comment. However, I was kind of occupied of describing a complete debugging session, hopefully, to the benefit of some other reader.

Connectez-vous pour commenter.

Plus de réponses (2)

Jan
Jan le 17 Nov 2016
Modifié(e) : Jan le 18 Nov 2016
fscanf might be easier then textscan:
[EDITED: bugs removed]
function [Data, HeaderLines] = asd(FileName)
fid = fopen(FileName, 'r');
if fid == -1
error('Cannot open file: %s', FileName);
end
maxBlocks = 10000; % Is this sufficient? Better too large.
HeaderLines = cell(1, maxBlocks);
Data = cell(1, maxBlocks);
iBlock = 0;
aBlock = cell(1, 20); % Or largest number of lines per block
while ~feof(fid)
iBlock = iBlock + 1;
Line = fgetl(fid);
if ~ischar(Line)
break;
end
HeaderLines{iBlock} = Line;
isEOB = false;
iData = 0;
while ~isEOB && ~feof(fid)
Line = fgetl(fid);
if ~ischar(Line) || strncmp(Line, 'EOB', 3)
isEOB = true;
else
iData = iData + 1;
len = length(Line);
[s1, num, err, ind1] = sscanf(Line, '%s', 1);
[f, num, err, ind2] = sscanf(Line(ind1:len), '%f', 2);
s2 = sscanf(Line(ind1+ind2:len), '%s');
aBlock{iData} = {s1, f(1), f(2), s2};
% Parse = textscan(Line, ' %s %f %f %s');
% aBlock{iData} = {Parse{1}{1}, Parse{2:3}, Parse{4}{1}};
end
end
Data{iBlock} = aBlock(1:iData); % Crop the data block
end
fclose(fid);
Data = Data(1:iBlock);
HeaderLines = strtrim(HeaderLines(1:iBlock));
end

6 commentaires

dpb
dpb le 17 Nov 2016
"fscanf might be easier then textscan:"
Don't see an fscanf call, anywhere, Jan... vbg
I was hoping to avoid the fgetl line-by-line scanning for the EOB marker...
Jan
Jan le 18 Nov 2016
Modifié(e) : Jan le 18 Nov 2016
@dpb: :-) During typing the code I decided that reading the single lines with 4 fsscanf calls looks uglier than running textscan on the string.
Now I can try to run the code (I had no Matlab available during typing the answer), I see, that my textscan command fails completely. I still hate this command. See [EDITED]. I admit that the hard-coded sscanf lines are not beautiful. But at least they are faster than textscan.
fgetl is not super slow. The disk access is the bottleneck or the iterative growing of the output, if the inital guess was too low. Is the idea of a slow fgetl based on the output of teh profiler? Otehrwise I claim it is a rumor only. Unfortunately we cannot test this using the tiny test file, because it is held in the disk cachae after the first reading, such that repeated trials do not show realistic timings.
dpb
dpb le 18 Nov 2016
fgetl isn't per se, no, but in general record-by-record reading is. If TMW or the OS has now implemented buffering that's effective, that'll certainly help, yes.
It's historically so, perhaps the technology has now advanced to make it less so; I've not tried to test recently, granted.
textscan has a lot not to like, granted; but it also has a lot of flexibility. It likely isn't going to compete with fscanf on speed, either.
@Jan Wow...you are right. fgetl is not slow...I'm perpetuating false rumors.
my code took ~10 hours and yours took about 20 minutes...so that's pretty amazing.
My only concern now is working with the format of the output...cell arrays within cell arrays. Also this code didn't write the header lines to an array( but not a big deal because I have code I can splice in that does that), which I need.
I'm just a geophysicist hack when it comes to coding.
@Jan Simon... i am pretty new with Matlab and because i have a similar Problem and it seems that your Code should work could you Comment the Lines for easier understandig. thanks
Jan
Jan le 9 Nov 2017
@FishermanJack: This would be very inefficient. I could spend hours with mentioning all details I know about the code lines. Most of the commands are trivial and I cannot guess, which commands are not clear to you. So better use the debugger to step through the code line by line, see, what happens in which order and read the documentation of command which are not clear. If any details are not clear afterwards, ask a specific question.

Connectez-vous pour commenter.

dpb
dpb le 18 Nov 2016
Modifié(e) : dpb le 18 Nov 2016
OK, for your file I used a grep utility first to find the EOB markers and then computed the numbers for each group...within Matlab it looked like--
>> cmd='grep -n EOB community_edit.txt >blocks.txt';
>> eob=textread('blocks.txt','%d:EOB');
>> neob=diff([0;eob])-2;
>> neob(1:10)' % see if looks ok...
ans =
11 10 11 12 10 12 13 12 8 8
That agrees with the number I get counting in editor.
Now, with that, read the first header and block then repeat for the 2:length(neob) remaining blocks with a header line (the EOB marker that's missing first group).
fmt1=repmat('%f',1,14); % header line
fmt2='%s%f%f%s'; % block data
fid=fopen(...
hdrs=zeros(neob,14); % room for the headers
hdrs=cell2mat(textscan(fid,fmt1,1,'collectoutput',1));
blks=textscan(fid,fmt2,neob(1),'collectoutput',1);
for i=2:length(neob)
hdrs(i,:)=cell2mat(textscan(fid,fmt1,1,'headerlines',1,'collectoutput',1));
blks(i)=textscan(fid,fmt2,neob(i),'collectoutput',1);
end
fid=fclose(fid);
Should be quite a bit quicker reading over fgetl.

2 commentaires

thanks dpb
I'm trying to implement your code but it's throwing this:
Error using diff
Function 'diff' is not supported for class 'cell'.
The output from
eob=textscan('blocks.txt','%d:EOB');
is a cell.
dpb
dpb le 18 Nov 2016
Modifié(e) : dpb le 18 Nov 2016
Oh, yeah, I forgot when I used the "approved" textscan over the deprecated textread that I use for simple cases to wrap the RHS in cell2mat to convert the cell to double array. Or, of course, you can use {:} to dereference the cell. But, my solutions in preferred order are--
  1. eob=textread('blocks.txt','%d:EOB'); % returns double directly
  2. eob=cell2mat(textscan(fid,'%d:EOB')); % ditto but cast req'd to do so(*) plus fopen/fclose hoopla
  3. neob=diff(eob{:}); % pain to dereference needless cell array w/o 1 or 2
() Actually, may also need _'collectoutput',1 as well, I forget what *textscan does by default for single value; if it's a cell of Nx1 or N cell 1x1 (or if that even matters in dereferencing; I try to avoid cell arrays like the plague so always have to 'spearmint to remember the rulez).

Connectez-vous pour commenter.

Catégories

En savoir plus sur Data Import and Analysis dans Centre d'aide et File Exchange

Produits

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by