- tabs
- newlines
- a null character (char(0))
Textscan encountering unwanted character. How do I kill that line and move on without killing the script??
6 vues (au cours des 30 derniers jours)
Afficher commentaires plus anciens
Hello:
I'm processing a large temporal dataset (data recorded every minute with 60+ columns). Right now, I'm using textscan() to parse it a bit. Maybe 5 times throughout one file, there is an upside-down question mark (¿) within th data. So, this kills my script because it expects a float, and I'd like to avoid that by skipping the column where it finds that character as well as the remaining data/columns in that textscan line, and treat them as empty. I've attached a few minutes of the data that include good data and one line with the ¿. Here's a bit of ugly code within a loop that deals with that:
filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, ...
'TextType', 'string', 'ReturnOnError', false,'EmptyValue',-Inf);
fclose(fileID);
I know it's probably not the most efficient way to do it, but it's what I've got now. I've looked a bit into regular expression replacement, but I could never get that to work. Any advice is appreciated.
4 commentaires
dpb
le 28 Mar 2019
How big is the actual file?
With today's memory, I'd be tempted to just load the whole thing in memory and clean up the offending lines, then process.
Or, it's surprisingly fast, just write a quick filter that kills any line if finds with the bum character...or use a standalone grep utility first...
magicchar=char(N); % whatever the offending character is
fidi=fopen('yourfile.txt','r');
fodo=fopen('newfile.txt,'w');
while ~feof(fidi)
l=fgets(fidi);
if contains(l,magicchar),continue,end
fprintf(fido,'%s')
end
fclose(fidi)
fclose(fido)
Réponse acceptée
Walter Roberson
le 28 Mar 2019
Modifié(e) : Walter Roberson
le 28 Mar 2019
filename = fileList(i).name;
delimiter = ' ';
formatSpec = '%*s%*s%*s%*s%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%f%[^\n\r]';
ncol = 72;
filler = repmat('-inf ', 1, ncol);
S = fileread(filename);
newS = regexprep(S, '\S*\x00.*?$', filler, 'lineanchors');
dataArray = textscan(newS, formatSpec, 'Delimiter', delimiter, 'TextType', 'string');
You already have a %[^\n\r] to eat to the end of line. Typically that will get "0 0 " in it (that is, if you were trying to read all the numeric columns as numeric then your count was off by 2). I take advantage of that eating by detecting the bad characters and substituting an entire full line's worth of -inf pattern, knowing that the -inf will be used by the %f format and that any left-over -inf will be eaten by the %[^\n\r] pattern. You will get a dataArray{end} line that has a number of "-inf" occurances. I figure that if the 0 0 was significant for something that you would have read it with %f%f .
10 commentaires
Walter Roberson
le 29 Mar 2019
Ah, I see it now, the 13.22.282 . Unfortunately, textscan is happy to treat that as 13.22 0.282 without noticing anything wrong. So yes, a fair bit would have to be known about the correct representation of numbers on the system. For example it helps to know for sure that it always puts leading 0. on valid fractional values < 1: some systems would instead leave out the leading 0 and go directly to the period, '0.282' versus '.282' .
Are the numbers certain to have 3 decimal places? And is it certain that a positive number will always have a single space after the comma but a negative value will have no space after the comma?
Plus de réponses (0)
Voir également
Catégories
En savoir plus sur Text Files dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!