Directory listing of extended ascii in windows

EDIT: This question raised some interesting issues but I don't consider it to be answered. Based on feedback from this question I have asked a similar question with a much more specific task, http://www.mathworks.com/matlabcentral/answers/86186-working-with-unicode-paths.
ORIGINAL: Hi all,
I have a filename with 'é' in it. Dir() doesn't work and reports this as two separate characters, 'e´'. I'm using Win 7. Is there a setting I can change in Matlab or Windows to get this to work right? If I use Java things seem to work fine:
my_java_dir = java.io.File(my_dir);
file_list = my_java_dir.listFiles();
I'd rather things "just work" instead of using Java.
Thoughts?
Thanks, Jim
EDIT: This is a summary of some of the comments:
The code I am running is:
temp = dir(my_path);
file_name = temp([#]).name
For a file on windows automatically generated using a proprietary program, the file name includes the following character, 'é'
In Matlab however, file_name contains the following chars instead: 'e´'
From what I can tell, using native Matlab functionality, it is not possible to read a non 7-bit ascii file on a mac:
EDIT: I did not realize this was going to be as difficult to actually accomplish (i.e. to answer properly) as it has turned out to be. The details of some of the tests I have run have become a bit lost in the comments although at this point they are not relevant to a solution. At this point I don't consider the problem to be solved but I don't even have a test framework for trying to solve this problem! When I get a chance I'll be uploading an example file for people to test. Thanks.

10 commentaires

Jan
Jan le 25 Août 2013
Please post the used code. It is not clear how you provide the special character as a string and "Dir() doesn't work" is not clear also. How does "Dir()" report the input? Do you mean Matlabs dir() command or do you call DIR of the operating system through system() or dos()?
Jim Hokanson
Jim Hokanson le 25 Août 2013
Modifié(e) : Jim Hokanson le 27 Août 2013
EDIT: Jan has showed that the test below is not sufficient to recreate the problem. I thought it was a sufficient test but it turned out it was an issue with some settings; I had feature('DefaultCharacterSet','UTF8') but if you change this to feature('DefaultCharacterSet','Windows-1252') the test below is not a problem.
The file already existed on disk, so all I was doing was:
temp = dir(my_path)
I've added some more details below:
One of the files contains the letter 'é' (on disk) but in Matlab the string is coming back as 'e´' (temp(#).name) which is 101 180 after converting to numbers
The file name is somewhat long, but shortened as an example, the file on the disk would be:
tést.txt
and in Matlab I get:
te´st.txt
Some more testing code to debug.
str = 'test.txt';
str(2) = char(233);
Note, Matlab shows str now has the value: tést.txt
fid = fopen(str,'w');
fwrite(fid);
fclose(fid);
On a mac, writing yields on disk:
t%E9st.txt
37 69 57 <= %E9
Reading this back into Matlab the filename is the same.
On one windows computer (win 7), writing yields on disk:
tÃ&copy;st.txt
But surprisingly, the file is read in correctly in Matlab as:
tést.txt
On that same computer, a file with an 'é' in the file system (NTFS) is read in as:
e followed by a box with
characters 101 65533
Note, this is different than the first windows computer I tried. Unfortunately I don't have access to that computer now, but the reading of the file in Matlab yielded:
e´
101 180 (in answer to Walter's question)
Thoughts?
Jim Hokanson
Jim Hokanson le 25 Août 2013
Also, all of this is done using 2013a (both win and mac)
Jan
Jan le 25 Août 2013
Modifié(e) : Jan le 25 Août 2013
See my answer.
Jim Hokanson
Jim Hokanson le 26 Août 2013
Clarification, yields on disk could alternatively be written as, "visualizing using the finder or explorer", although I believe the results are the same using the "current folder" viewer in Matlab. In other words, the Matlab folder viewer matches Win explorer or Mac finder, and not the results returned from the dir command.
Jim Hokanson
Jim Hokanson le 26 Août 2013
Just to do a bit more testing, on my mac, the dir command returns:
eÌ 101 204
UPDATE: I had changed my default character set to UTF8 using:
feature('DefaultCharacterSet','UTF8')
When I change it to Windows-1252, my test of 'tést.txt' works fine (as Jan points out in his answer). Unfortunately, reading the file using dir is still a problem. Using Windows-1252, I get in Matlab:
'e´'
Using dir, I mentioned that this shows up as 101 180 in Matlab when converting to a numeric.
Interestingly, if I use the Java approach, the length of my file in Java is 130. When I convert the Java string to a Matlab string, I get what looks in the command window to be the correct letter:
é
However, when converting to a numeric the character is now: 101 769
where somehow, these two numbers are being rendered as a single character
The string is still of length 130.
Ok, I think this might be the problem.
Unfortunately, although the Java displays correctly, it turns out that the path is not valid. I originally ran into the problem with:
d = dir; file_path = fullfile(base_path,d(#).name) exist(file_path,'file') %false
In which the result was that the file did not exist. In that case however I was using dir where you get really strange results, not the Java code, where it looks like the file exists, until you convert it to a Matlab character string and it doesn't, i.e.
file_list = my_java_dir.listFiles();
file_path = char(file_list(#));
exist(file_path,'file') %is false
Note: I tried the same thing on the mac using Java, and Matlab says the file does not exist.
Walter Roberson
Walter Roberson le 26 Août 2013
After scanning a bit through the decomposition / recomposition document, my head hurts!
Jan
Jan le 27 Août 2013
Modifié(e) : Jan le 27 Août 2013
@Jim: This is an important question and equivalent problems will occur in the work of many users. The humor-looking part of my replies is caused by frustration after struggling with Unicode too long. But the problem is serious and my suggestion to avoid non-ASCII is also.

Connectez-vous pour commenter.

Réponses (2)

Walter Roberson
Walter Roberson le 24 Août 2013

1 vote

What is the underlying file system type of the directory you are trying to work with? If it is not NTFS then you have a problem; see http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748%28v=vs.85%29.aspx

3 commentaires

Jim Hokanson
Jim Hokanson le 25 Août 2013
I checked Disk Management and it is NTFS.
Walter Roberson
Walter Roberson le 25 Août 2013
Could you show the result of adding (numeric) 0 to the name string ? (That will show the decimal equivalent of each character in the string). I'm thinking that possibly there is a "compose" or "dead byte", which is one of the ways of representing accented characters.
Jim Hokanson
Jim Hokanson le 25 Août 2013
Dead bytes, yikes! I like +0, easier to type than double(str). I've added some clarifications in response to Jan's question, see above. Thanks.

Connectez-vous pour commenter.

Jan
Jan le 25 Août 2013
Modifié(e) : Jan le 25 Août 2013
This sounds totally cruel. I've struggled UTF16 and UTF8 conversions for the file access also.
When I run this on my Win7/64 PC/local NTFS disk/Language = 'en_us.windows-1252' I get the expected correct results:
str = ['t', 233, 'st.txt'];
fid = fopen(str,'w');
fclose(fid);
a = dir('t*.txt'); % other patterns do not change the answer
double(a.name)
>> 116, 233, 115, 116, 46, 116, 120, 116
This is displayed in the Windows Explorer correctly also. But the DOS command DIR fails of course:
!dir t*st.txt
>> 25.08.13 23:20 8 tst.txt
It matters what "yields on disk" exactly mean. How did you test this?

5 commentaires

Jim Hokanson
Jim Hokanson le 26 Août 2013
Well I'm glad it works for someone! Can you clarify where you are seeing en_us.windows-1252? For me, all I see when I follow the instructions in the link you provided is the "Language" being set to "English" in the control panel.
But that link seems to reference Java, yet it seems like Java is working fine (as evidenced by the current folder viewer and my Java code working). Instead the problem seems to be with some native library interface (perhaps c-based or some system call) that is not working.
To partially answer my question, as a link in your link points out, this won't work on a mac, the characters used must be 7-bit ascii :/
Jan
Jan le 26 Août 2013
Modifié(e) : Jan le 26 Août 2013
The active language in Matlab can be obtained by:
get(0, 'Language');
Jim, I'm feeling with you. The UTF8, UTF16, 2 and 4 byte wchar chaos is a disaster. The manufacturers have not been able to find an agreement for the 7bit-ASCII "standard" and the Matlab users still suffer from \n or \r\n line breaks and accessing file in text format. But they didn't take the chance to prefer UTF8 or any other single standard and a huge library is required to provide an mxChar string to _wfopen in a C program under Win, Mac and Linux reliably. See Answers: Matlab string to wchar
So my advice is straight: Do not use special characters in file names.
Cedric
Cedric le 26 Août 2013
Well, it's difficult to enforce these darn French speakers not to use accents, which are everywhere on their keyboards ;-)
Bonne journée à tous !
Cédric
Jan
Jan le 26 Août 2013
Modifié(e) : Jan le 26 Août 2013
I have some dull English keyboards in my storage place. There is even one with a missing [shift]-key and if somebody wants to appear cool, I can even remove the vowels from a Swedish keyboard.
I had severe troubles to reconstruct a backup under Windows, because the paths exceeded the magic 260 character length due to deeply nested folders with names like "Muskelzelle, 5 Proz. Kochsalzlösung, 2-60 Stunden Einwirkzeit, 60-fach vergrößert, Ethidium bromid, ausgewertet, ok". And here the troubles have not been caused by the special characters.
30 years after MS-DOS there is still a limitation to only 260 characters in the file name for many important API functions of Windows as deleting to the trash and e.g. showing the folder in the Windows Explorer. This is such cruel and unprofessional, that I cannot understand, why users discuss about tiles and the missing start button of Win8.0. Some API functions accept long file names, when the ridiculous "\\?\" is added in front of the name, so MS did recognize the need for this feature already. But long names are far from working reliably.
So my impression is, that the NTFS file system with its UTF16 strings and the possibility for long names is mature and stable, but the Windows functions for accessing this format are still in their infancy and the level of childishness of the problems is such low, that I'd call them "bugs".
Maybe MS decided purposely to impede French, Chinese and speakers of Tagalog to increase its profit in a strange and obscure way. And while the French and the Chinese have developed Linux (with help from some Finnish), the Filipinos have written MacOS-X with the strange idea to use neither 2 byte nor 4 byte wchar's.
Using special characters in file names, especially when different operating systems access the files, is a bad idea, obviously. Do not let the childish OS ninjas involve you in their sandbox battle. 7-bit ASCII looks even good when written to durable pottery.
But seriously, unicode is nice and the way to go in the future. But currently it is neither supported reliably by the operating systems nor by Matlab. Problems like the destroyed accents will occur and can be expected. Therefore it is still a good idea to keep file names short and simple, while the interesting details in French should be hidden inside the data of the file.
[EDITED] Sorry, not the Chinese have participated in the development of Linux, but the Japanese decided to remove the \r from the line breaks for obvious reasons.
Jim Hokanson
Jim Hokanson le 27 Août 2013
Jan, I agree, don't use special characters in file names. I tend not to but this particular example came from some file "in the wild." It would be nice to have a well documented set of rules of what can be done and what can't with respect to unicode. For example, Matlab's usage of a 16 byte character means it is impossible to accurately handle UTF-8 data streams which are only well mapped to UTF32 (4 byte character) data. Like many things, I think the first step is probably well documented (centrally, i.e. by TMW) usage modes and failures points.
Cédric, the problem actually comes from a Hungarian name, Georg Von Békésy, so it's the Hungarians that are giving me problems, not the French :)

Connectez-vous pour commenter.

Catégories

Question posée :

le 24 Août 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by