How do I parse this complex text file with textscan?
Afficher commentaires plus anciens
I have a text file that is in a rather funky format. The file comes out of a relational database (Antelope) and consists of earthquake location, dates, times, phase information, etc. I need to parse out and collect the 'data blocks' that are in between each header line. I need the header lines as well for each "block". I have edited the file to include an EOB (end of block) marker to make this task easier, but it's not as trivial as I thought. Here's an image of the first 68 or so lines (out of about 1 million).

I'd like to pull the 4 columns below each header.... for example the first section is:
2015 1 22 0 8 58.537 45.97929 -129.98717 1.184 0.0 1.039 3.621 0.036 1
AXCC1 0.843 1.00 P
AXAS2 1.263 1.00 P
AXEC1 0.923 1.00 P
AXEC2 1.103 1.00 P
AXEC3 1.088 1.00 P
AXCC1 1.873 0.25 S
AXAS1 2.728 0.06 S
AXAS2 2.168 0.25 S
AXEC1 1.708 0.33 S
AXEC2 2.043 0.25 S
AXEC3 2.113 0.25 S
and put those in an array. But I need to be able to associate the header line, specifically the last integer in the header line (1 in this case), with each code block.
So far my code looks like this, but obviously it is not working yet. I don't get any errors but it's missing and skipping data etc.
fid=fopen('ph2dt_catalog8_edit.dat');
Block=1;
while (~feof(fid))
InputText=textscan(fid,'%s',1,'delimiter','\n');
HeaderLines{Block,1}=InputText{1};
disp(HeaderLines{Block});
FormatString='%s%f%f%s';
InputText=textscan(fid, FormatString, 'delimiter','WhiteSpace','CollectOutput',1);
Data{Block,1} = cell2mat(InputText{2});
[NumRows,NumCols] = size(Data{Block});
eob=textscan(fid,'%s',1,'delimiter','\n');
Block=Block +1;
end
Can anyone offer any suggestions. Let me know if I need to clarify anything further.
14 commentaires
dpb
le 16 Nov 2016
Is there any way when generating the file to get the number of records in the block? If that could be added into either the existing header line as another variable or as a separate record for each block when generating the report it would make parsing "on the fly" much simpler; you could have a specific count for each block.
Will wait to find out if this is possible before thinking much about other options...
Walter Roberson
le 16 Nov 2016
It appears that you start with a header to a block (and that header appears to start with a year). Now, within each block after the header, are there always the same number of reporting station entries for any one file? Is that number fixed (same for every file) ?
dpb
le 17 Nov 2016
It's not in the sample, Walter. There are 11,10,11,12,10,... per block which is why I asked if he could get the database query to also return that value...
dpb
le 17 Nov 2016
Screen shots "not so much". A short section of the file attached would let others have a shot...
If you can get that additional info then writing a loop for each block and simply parsing that number to use in the textscan format repeat count should make parsing a "piece-o-cake".
per isakson
le 17 Nov 2016
How many different stations are represented in the entire file?
dpb
le 17 Nov 2016
You mention there are about 1M records; the records are pretty short, how big is the file (in MB)? It's possible you could scarf it up as character image and do a search for the EOB markers to compute the blocksize. I'm guessing Per's thinking on the lines of regular expressions looking for the groups of lines containing one of the station IDs if that population is known (or knowable) a priori...
psprinks
le 17 Nov 2016
dpb
le 18 Nov 2016
No joy on being able to generate the block sizes from the database I gather? That'd be the cat's meow methinks...*fgetl* will be slow.
psprinks
le 18 Nov 2016
I do not think that fgetl is the bottleneck here, but the iterative growing of the output. I expect that textscan is slower than fgetl and a specific parsing of the lines, because textscan is so much "smarter". Smartness costs time.
"strcmp(tline(1),'#')" ??? I do not see a "#" in the posted test file. Please post a real data file and explain, which output you want.
psprinks
le 18 Nov 2016
per isakson
le 19 Nov 2016
Modifié(e) : per isakson
le 19 Nov 2016
"I'd like to pull the 4 columns below each header"   Your script doesn't extract the third column. And what is the intent for   MATDAY_ARV=datenum(...) ?
Réponse acceptée
Plus de réponses (2)
fscanf might be easier then textscan:
[EDITED: bugs removed]
function [Data, HeaderLines] = asd(FileName)
fid = fopen(FileName, 'r');
if fid == -1
error('Cannot open file: %s', FileName);
end
maxBlocks = 10000; % Is this sufficient? Better too large.
HeaderLines = cell(1, maxBlocks);
Data = cell(1, maxBlocks);
iBlock = 0;
aBlock = cell(1, 20); % Or largest number of lines per block
while ~feof(fid)
iBlock = iBlock + 1;
Line = fgetl(fid);
if ~ischar(Line)
break;
end
HeaderLines{iBlock} = Line;
isEOB = false;
iData = 0;
while ~isEOB && ~feof(fid)
Line = fgetl(fid);
if ~ischar(Line) || strncmp(Line, 'EOB', 3)
isEOB = true;
else
iData = iData + 1;
len = length(Line);
[s1, num, err, ind1] = sscanf(Line, '%s', 1);
[f, num, err, ind2] = sscanf(Line(ind1:len), '%f', 2);
s2 = sscanf(Line(ind1+ind2:len), '%s');
aBlock{iData} = {s1, f(1), f(2), s2};
% Parse = textscan(Line, ' %s %f %f %s');
% aBlock{iData} = {Parse{1}{1}, Parse{2:3}, Parse{4}{1}};
end
end
Data{iBlock} = aBlock(1:iData); % Crop the data block
end
fclose(fid);
Data = Data(1:iBlock);
HeaderLines = strtrim(HeaderLines(1:iBlock));
end
6 commentaires
@dpb: :-) During typing the code I decided that reading the single lines with 4 fsscanf calls looks uglier than running textscan on the string.
Now I can try to run the code (I had no Matlab available during typing the answer), I see, that my textscan command fails completely. I still hate this command. See [EDITED]. I admit that the hard-coded sscanf lines are not beautiful. But at least they are faster than textscan.
fgetl is not super slow. The disk access is the bottleneck or the iterative growing of the output, if the inital guess was too low. Is the idea of a slow fgetl based on the output of teh profiler? Otehrwise I claim it is a rumor only. Unfortunately we cannot test this using the tiny test file, because it is held in the disk cachae after the first reading, such that repeated trials do not show realistic timings.
dpb
le 18 Nov 2016
fgetl isn't per se, no, but in general record-by-record reading is. If TMW or the OS has now implemented buffering that's effective, that'll certainly help, yes.
It's historically so, perhaps the technology has now advanced to make it less so; I've not tried to test recently, granted.
textscan has a lot not to like, granted; but it also has a lot of flexibility. It likely isn't going to compete with fscanf on speed, either.
psprinks
le 18 Nov 2016
FishermanJack
le 9 Nov 2017
@Jan Simon... i am pretty new with Matlab and because i have a similar Problem and it seems that your Code should work could you Comment the Lines for easier understandig. thanks
Jan
le 9 Nov 2017
@FishermanJack: This would be very inefficient. I could spend hours with mentioning all details I know about the code lines. Most of the commands are trivial and I cannot guess, which commands are not clear to you. So better use the debugger to step through the code line by line, see, what happens in which order and read the documentation of command which are not clear. If any details are not clear afterwards, ask a specific question.
OK, for your file I used a grep utility first to find the EOB markers and then computed the numbers for each group...within Matlab it looked like--
>> cmd='grep -n EOB community_edit.txt >blocks.txt';
>> eob=textread('blocks.txt','%d:EOB');
>> neob=diff([0;eob])-2;
>> neob(1:10)' % see if looks ok...
ans =
11 10 11 12 10 12 13 12 8 8
That agrees with the number I get counting in editor.
Now, with that, read the first header and block then repeat for the 2:length(neob) remaining blocks with a header line (the EOB marker that's missing first group).
fmt1=repmat('%f',1,14); % header line
fmt2='%s%f%f%s'; % block data
fid=fopen(...
hdrs=zeros(neob,14); % room for the headers
hdrs=cell2mat(textscan(fid,fmt1,1,'collectoutput',1));
blks=textscan(fid,fmt2,neob(1),'collectoutput',1);
for i=2:length(neob)
hdrs(i,:)=cell2mat(textscan(fid,fmt1,1,'headerlines',1,'collectoutput',1));
blks(i)=textscan(fid,fmt2,neob(i),'collectoutput',1);
end
fid=fclose(fid);
Should be quite a bit quicker reading over fgetl.
2 commentaires
psprinks
le 18 Nov 2016
Oh, yeah, I forgot when I used the "approved" textscan over the deprecated textread that I use for simple cases to wrap the RHS in cell2mat to convert the cell to double array. Or, of course, you can use {:} to dereference the cell. But, my solutions in preferred order are--
- eob=textread('blocks.txt','%d:EOB'); % returns double directly
- eob=cell2mat(textscan(fid,'%d:EOB')); % ditto but cast req'd to do so(*) plus fopen/fclose hoopla
- neob=diff(eob{:}); % pain to dereference needless cell array w/o 1 or 2
() Actually, may also need _'collectoutput',1 as well, I forget what *textscan does by default for single value; if it's a cell of Nx1 or N cell 1x1 (or if that even matters in dereferencing; I try to avoid cell arrays like the plague so always have to 'spearmint to remember the rulez).
Catégories
En savoir plus sur Data Import and Analysis dans Centre d'aide et File Exchange
Produits
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!








