Split cvs on commas but prevent doing so for a string with a comma in it
Afficher commentaires plus anciens
My Excel csv file looks like this:
Data,test,04-12-2020 13:11,0,"8,2",1,2,3
Currently I use the following code to seperate the columns:
[~,~,dataCGM] = xlsread('file.csv');
outCGM = regexp(dataCGM, ',', 'split');
outCGM = outCGM(2:end-1);
This does split the columns on commas but also does so for the string "8,2" which is not what I want. Does anyone know how to prevent this issue and keep the value as a string in a single column?
Réponses (2)
Cris LaPierre
le 13 Déc 2020
0 votes
20 commentaires
Tycho Maas
le 13 Déc 2020
Modifié(e) : Tycho Maas
le 13 Déc 2020
Cris LaPierre
le 13 Déc 2020
Modifié(e) : Cris LaPierre
le 13 Déc 2020
Tycho Maas
le 13 Déc 2020
"This does not work, since I have a csv file and no text files."
Apparently some people incorrectly think of them as "Excel files", for reasons that are not clear to me.
Cris LaPierre
le 13 Déc 2020
It worked on the sample you have provided. I ony need xlsread, not the rest of your code.
[~,~,dataCGM] = xlsread("file.csv")

What is the desired end result?
Perhaps share your file so we can see what you see? Use the paperclip icon to attach it to your post.
Tycho Maas
le 13 Déc 2020
Cris LaPierre
le 13 Déc 2020
Just pointing out that this is not a csv file anymore. Should it be?
Tycho Maas
le 13 Déc 2020
There is something wrong with your csv file format. I've recreated it.
I see you told Walter you can't predefine anything about data types. That's a shame, as that's is a strength of MATLAB. With that in mind, here's a minimalist approach that recognizes every field (every comma) w/o making any assumptions about the data it contains.
opts = detectImportOptions('File_MATLAB_test.csv');
opts.TrailingDelimitersRule = 'keep';
opts.ConsecutiveDelimitersRule = 'split';
data = readtable('File_MATLAB_test.csv',opts,"ReadVariableNames",false)
Tycho Maas
le 13 Déc 2020
Cris LaPierre
le 13 Déc 2020
Yes, you can see in the top right corner of my post that the code was run in R2020b. There is no TrailingDelmitersRule in detectImportOptions in R2018a.
Tycho Maas
le 13 Déc 2020
Cris LaPierre
le 13 Déc 2020
Somehow the file you are using is different from what you've posted. Can you share your actual file?
Tycho Maas
le 13 Déc 2020
Cris LaPierre
le 13 Déc 2020
Are you sure? Each line is encased in quotes. That is probably where we are different.
Your file:
"FreeStyle LibreLink,data,04-12-2020 12:24,0,""5,3"",,,,,,,,,,,,,,"
"FreeStyle LibreLink,data,04-12-2020 12:41,0,""4,9"",,,,,,,,,,,,,,"
What I would expect:
Data,test,04-12-2020 12:24,0,"8,2",,,,,,,,,,,,,,
Data,test,04-12-2020 12:41,0,"5,9",,,,,,,,,,,,,,
If the raw file actually has the quotes, that will change things. Open it in a text editor to check, not Excel.
Tycho Maas
le 13 Déc 2020
Cris LaPierre
le 14 Déc 2020
Modifié(e) : Cris LaPierre
le 14 Déc 2020
You probably won't believe how much trouble a little quote can cause. I don't love it, but this works.
Maybe someone like @Stephen Cobeldick, who is a regexp ninja, can improve on this.
fid = fopen("Test_file.csv");
% capture variable names
str=fgetl(fid);
varnames = textscan(str,'%s','Delimiter',',');
% capture the remaining file contents (assumed to be uniform)
raw = textscan(fid,'%q','Delimiter',',');
fclose(fid);
% Split the raw data by delimiter, keeping quoted text together
ss = @(C) strsplit(C,'(?!\<"[^"]*),(?![^"]*"\>)','CollapseDelimiters',0,'DelimiterType','RegularExpression');
M=cellfun(ss,raw{1},'UniformOutput',false);
% convert to a table
T = cell2table(M);
% Make each column its own variable. Name columns using variable names from file.
T=splitvars(T,1,"NewVariableNames",varnames{1})
Tycho Maas
le 14 Déc 2020
"Maybe someone like @Stephen Cobeldick, who is a regexp ninja, can improve on this."
Thank you for the unique commendation.
Although it is probably not the fastest approach, I would try importing the entire file as one string, apply some string manipulation to it to remove the line-end quotation marks (e.g. REGEXPREP), and then write a new file which can then be directly imported using READTABLE. That has the benefit of importing all the different data classes correctly without much overhead and all of the standard READTABLE options.
It is not trivial because of course valid quotes around a string should not be removed.
This issue pops up enough to indicate that it would be nice for it to be handled natively:
Perhaps it would be a useful addition for READTABLE et al to include an option named e.g. LINEQUOTE which can be set to the required character (by default empty).
Cris LaPierre
le 14 Déc 2020
I can only make it work for what I see.
You can look into what settings are available from detectImportOptions. I suspect the NumHeaderLines is what you are looking for.
Walter Roberson
le 13 Déc 2020
readtable() with a format that is
'%s,%s,%{dd-MM-uuuu HH:mm}D,%f,%q,%f,%f,%f'
2 commentaires
Tycho Maas
le 13 Déc 2020
Image Analyst
le 13 Déc 2020
That makes no sense. A program will not "work on itself". You need to tell your code HOW to process the file. It won't magically figure it out. Attach your csv file if you need more help.
Catégories
En savoir plus sur Text Data Preparation dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!