Mixed character Tab delimited csv file read in

Hi I have a CSV file which has two tab seperated columns (according to excel).
I'm looking for a way to read this in. The problem is that there are a number of 'sections' within the file that follow on from one another, each with their varying number of header lines (please see screenshots).
I'd like to just read in the numeric data and be able to seperate these into columns as at the moment because of the text portions I'm unable to successully use importdata/readtable etc functions. I have also tried [ndata, text, alldata] = xlsread(filename); but the numeric data is read in as text as it is delimited.

2 commentaires

Stephen23
Stephen23 le 18 Mai 2021
@radha soni: please upload a sample file by clicking the paperclip button.
radha soni
radha soni le 19 Mai 2021
Sorry i've inserted the full file. thank you

Connectez-vous pour commenter.

 Réponse acceptée

fm1 = '#%s%[^\n\r]';
op1 = {'CollectOutput',true, 'Delimiter',':'};
fm2 = '"%f%f"';
op2 = {'CollectOutput',true};
hdr = {};
dat = {};
[fid,msg] = fopen('Lunghypo.csv','rt');
assert(fid>=3,msg)
while ~feof(fid)
hdr(end+1) = textscan(fid,fm1,op1{:});
dat(end+1) = textscan(fid,fm2,op2{:});
end
fclose(fid);
Checking:
hdr{1}
ans = 6×2 cell array
{'PatientName' } {'LungHypo005' } {'PatientId' } {'LungHypo005' } {'Dosename' } {'Plan dose: 30Gy in 5 (CT 1)'} {'RoiName' } {'External' } {'Roi volume fraction outside grid'} {'0%' } {'Dose unit' } {'cGy' }
dat{1}
ans = 402×2
0 100.0000 0 100.0000 0 99.5000 0 99.0000 0 98.5000 0 98.0000 0 97.5000 0 97.0000 0 96.5000 0 96.0000
hdr{end}
ans = 3×2 cell array
{'RoiName' } {'PRVcord'} {'Roi volume fraction outside grid'} {'0%' } {'Dose unit' } {'cGy' }
dat{end}
ans = 402×2
0 100.0000 0 100.0000 0 99.5000 0 99.0000 0 98.5000 0 98.0000 0 97.5000 0 97.0000 0 96.5000 0 96.0000

9 commentaires

radha soni
radha soni le 19 Mai 2021
Hi Stephen that's great I didn't know these functions could be used like this thank you. Do you know why if i comment out the header portions of the code it gets stuck in the while loop
Stephen23
Stephen23 le 19 Mai 2021
Modifié(e) : Stephen23 le 19 Mai 2021
"Do you know why if i comment out the header portions of the code it gets stuck in the while loop"
If you comment out the TEXTSCAN that reads the block header lines then the header lines do not get read. The TEXTSCAN that reads the data text cannot match the block header lines, and so the code probably gets stuck in an infinite loop of trying to match that data format to text (which it does not match) and cannot advance the file pointer. Therefore FEOF is probably never reached.
radha soni
radha soni le 19 Mai 2021
thank you
radha soni
radha soni le 19 Mai 2021
one last question - i don't fully understand the purpose of the assert(fid>=3,msg) line
Stephen23
Stephen23 le 19 Mai 2021
Modifié(e) : Stephen23 le 19 Mai 2021
"i don't fully understand the purpose of the assert(fid>=3,msg) line"
It prints an informative message if the file cannot be opened or accessed. If the FOPEN cannot find/open the file then your code will anyway throw an error (later, when TEXTSCAN tries to use an invalid file handle), but with that line you will get useful information telling you why the file could not be opened.
radha soni
radha soni le 19 Mai 2021
thank you - i've tried to run another file with the same formatting through as is unchanged and it is getting stuck in the loop for some reason, do you have any idea why at all? thanks for your help
radha soni
radha soni le 19 Mai 2021
hmm for some reason i have to go into excel to open the file if i haven't opened it before and accept this error message then it's ok
Stephen23
Stephen23 le 19 Mai 2021
Modifié(e) : Stephen23 le 19 Mai 2021
"i've tried to run another file with the same formatting through as is unchanged ..."
Nope, they are different. Do NOT rely on Excel to compare file formats.
"... and it is getting stuck in the loop for some reason, do you have any idea why at all?"
Because the format is not the same. Your original file Lunghypo.csv has data lines that are actually saved as strings in double quotes (which is incredibly awfully bad file creation by whatever app that created it):
..
"0.000 100.000"
"0.000 100.000"
"0.000 99.500"
..
Because of this very very very poor hiding of numeric data inside string delimiters I added the literal double quotes to the format string so that TEXTSCAN could correctly handle them.
Now your new file LtBreastRS.csv has data lines without double quotes:
..
0.000 100.000
0.000 100.000
0.000 99.500
0.000 99.000
0.000 98.500
..
Note:
  • use an actual text editor (e.g. notepad++) to look at the format of CSV files, as Excel does a lot of data mangling before showing you anything. It appears that every time you open in Excel and save you are changing the formatting.
  • requiring that you can import data from files with different formats using the same code makes your code more complex. The best solution is to actually stick to just one format. I recommend that numeric data should not be hidden inside strings,then you can simply remove the double quotes from the format string:
fm2 = '%f%f';
radha soni
radha soni le 20 Mai 2021
Thank you - that's VERY helpful I've learnt a lot. Unfortunately the export is from a piece of commercial software I have no control over, But i can make a note on the matlab script for users to not open export files in excel ensure consistent formatting, Thanks again for all your help.

Connectez-vous pour commenter.

Plus de réponses (2)

Jeremy Hughes
Jeremy Hughes le 19 Mai 2021

1 vote

Ahhh, that's because the data is in quotes.
readmatrix assumes quoted data is meant to be preserved, so the only way to get that data out would be as text:
A = readmatrix(filename,"OutputType","string");
Then parse the lines separately with textscan as in Stephen's answer. (you'd remove the double-quote from the format)

2 commentaires

radha soni
radha soni le 19 Mai 2021
brilliant thank you
oops meant that to be a comment on the other answer.

Connectez-vous pour commenter.

Jeremy Hughes
Jeremy Hughes le 18 Mai 2021

0 votes

Just a note: CSV means comma-separated-values, but if this is TAB separated, it's not really CSV.
If the data in the sections are important, I suggest you use READCELL, as that will bring everything in as the right type.
C = readcell(filename,"Delimiter","\t")
If you don't care about the headers, then readmatrix might be better, and you can skip those headers since they all seem to start with '#':
A = readmatrix(filename,"CommentStyle","#","Delimiter","\t")

2 commentaires

radha soni
radha soni le 19 Mai 2021
Hi Jeremy thanks for your resposnse, yes sorry it's not really a CSV it's just saved as .csv so that it's matlab accessible. When I use the text to columns wizard in excel it indicates that the data is tab delimited
i've just inserted the full file. that's right the headers do all start with # so it seems like readmatrix should work but when i use the function, A contains NaN for all cells
Ahhh, that's because the data is in quotes.
readmatrix assumes quoted data is meant to be preserved, so the only way to get that data out would be as text:
A = readmatrix(filename,"OutputType","string");
Then parse the lines separately with textscan as in Stephen's answer. (you'd remove the double-quote from the format)

Connectez-vous pour commenter.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by