Check that *.txt file is really a TXT formatted file?

25 vues (au cours des 30 derniers jours)
Marco
Marco le 19 Mai 2015
Commenté : Walter Roberson le 23 Août 2016
Hello!
How could I detect if the content of a *.txt file is really txt formatted, before further proceeding that file with my data import parser? I searched in folders for all files with file extension TXT in order to work with the data stored in each of them. In principal no problem so far. But it sometimes happened that a file has wrongly been stored as a *.TXT named file while its content is not in TXT format, but instead in some binary format (i.e. should better have been namened *.XLS).

Réponse acceptée

Guillaume
Guillaume le 19 Mai 2015
It all depends on what you call a text file.
If it's an ASCII file, then the code value of the characters is limited to 0-127, so you could test if any character has a value > 127. The presence of code values in the range 0-31 with the exception of 9 (tab), 10 and 13 (new lines) would also be a strong indication that the content is not meant to be read as text. It's not a guarantee though.
If it's an extended ASCII file, then the whole range 0-255 is used. Other than semantics, there's nothing distinguishing a text file from a binary file. Again characters in the range [0-8, 11-12, 14-31] would be an indication.
If it's an UTF8 file, there are some combinations that are not allowed and you could try to detect them. Again [0-31] is an indication that it's not meant to be text.
Perhaps, instead of trying to discriminate text files against binary, what you should be discriminating is files conforming to the format your code expects and those that don't?
  5 commentaires
Guillaume
Guillaume le 19 Mai 2015
Modifié(e) : Guillaume le 19 Mai 2015
@Walter,
Can matlab decode UTF-16? It's certainly not listed as an option for the encoding of fopen.
Also,
filestart = char(fread(fid, numel(expectedstart)))';
%or
filestart = char(fread(fid, [1 numel(expectedstart)]));
%or
filestart = fread(fid, [1 numel(expectedstart)], '*char');
would be more akin to fscanf. But fread only works if the characters are ASCII (or more precisely, just one byte per code point).
UTF8 is the same as bytes for those code points < 128. Anything above that use more than one byte per character.
Walter Roberson
Walter Roberson le 23 Août 2016
Yes, MATLAB can decode UTF-16, both little endian and big endian. It can also decode UTF-32 little endian and big endian. For any of these MATLAB will issue a warning when you fopen() the file about the encoding not being supported, but really what that means is that MATLAB does not support writing files in those formats.

Connectez-vous pour commenter.

Plus de réponses (1)

Stephen23
Stephen23 le 19 Mai 2015
Modifié(e) : Stephen23 le 19 Mai 2015
It is important to note that files themselves have no semantic meaning: they are merely lots of bits that can be interpreted in a particular way, given a known encoding. To answer your question you really need to answer this question: What exactly is a text file?
Here are two methods that you could try:
  • Read the file data, and check that all of the "characters" are within the expected character range (e.g. alphanumeric, punctuation, spaces, etc). This would work best when the data is of a limited kind (e.g. numeric data) and uses only a small character set (e.g. ASCII). This is also dependent on character encoding/format, and several other factors so it is very fragile in practice.
  • Read the first few bytes and check if it matches any known file signature. This is also fragile in practice, as it would miss formats not covered by the list of signatures.
  3 commentaires
Stephen23
Stephen23 le 19 Mai 2015
Modifié(e) : Stephen23 le 21 Mai 2015
It won't crash, but don't use fgetl: this will read to the next newline character, which if this is a binary file there may be no such combination of bits that looks like a newline. And so this simple "line" ends up being 5 GiBi of random data... or however big that file might be.
A better solution would be to use fscanf, as Guillaume explained, and reading just the number of bits that you need to identify the file. You can find more useful file reading functions here:
And because you already know the first characters, then you can simply check that these are what the file contains.
Marco
Marco le 19 Mai 2015
Thanks a lot, really helpful! As I could only accept one answer, I at least gave you my vote.

Connectez-vous pour commenter.

Catégories

En savoir plus sur Low-Level File I/O dans Help Center et File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by