How to internally change decimal separator while importing data from a csv?

Hello,
I am currently working with very large data files (csv) with approximately 15000 rows and a couple hundred columns. Since the data in the cells is derived by european software, the decimal separator turns out to be a comma and I am replacing it with a dot by using a small script. This script works very well but the corrected data has to be stored in a new file since the original file must not be manipulated. What I want to do is to import the data from the file, generate a workspace variable that contains all the data but has the commas replaced by dots. This would be very useful because for processing the data any further I always have to generate a new file and load that in manually. Since that is quite time-consuming, I would like to avoid this step.
Current Process:
%% Small script that replaces comma by dot and saves corrected data to new csv-file with name [...]_dots
[filenames,path]=uigetfile('*.csv');
cd(path);
disp(filenames)
oldfile=filenames(1:end-4);
NewFileName = sprintf('%s_Dots.csv',oldfile);
Data = fileread(filenames);
Data = strrep(Data, ',', '.');
% Save corrected data to new file
FID = fopen(NewFileName, 'w');
fwrite(FID, Data, 'char');
fclose(FID);
When having the dot as a decimal separator, the import is very easy and generates a handy structure array that is used in the further evaluation of the data as seen below. The structure array is used for further examination of the data
i_start_import=1;
datastruct=importdata(filenames,';',i_start_import);
However, the importdata function does not generate a structure array when the comma is used as a decimal separator.
Experiments:
What I have tried so far to receive the same struct but with dots as a separator is to read in the original data with importdata to further convert the comma to dot and then split the data string to receive tabular stored data. Unfortunately, this is very time consuming and takes too long for all the thousands of cells in my csv (especially the splitting process).
The data after applying importdata looks like this:
And is then split to this:
data=importdata(filenames,';',100000);
% Create structure array
datastruct=struct
datastruct.data=[];
% Replace comma by dot
for i=1:length(data)
a=strrep(data{i},',','.');
data{i}=a;
end
% Split data string
for i=1:length(data) %index rows
for j=1:516 %since there are 516 columns, index columns
str=strsplit(data{i},';');
datastruct.textdata{i,j}=str(1,j);
if j==516
break
else
continue
end
end
end
Ideally, the data should be stored in a structure array with field textdata having the values of the first column (TimeStamp) and the first row (variables) and a field data having all the data.
I am now looking for a way to read in the original data with comma as decimal separator and convert the comma to dot internally and save it as a variable (struct) that I can use in my further examination in the script.

2 commentaires

Some pitfalls:
  • Do not use the important function "path" as name of a variable. This can cause extremely strange effects during debugging.
  • Avoid to use cd() to change the current directory. The callbacks of GUIs or TIMERs could do this also and the assumes files are not found anymore. This is a frequent source of bugs. Use absolute path names instead using fullfile(folder, filename).
  • In:
for j=1:516 %since there are 516 columns, index columns
str=strsplit(data{i},';');
datastruct.textdata{i,j}=str(1,j);
if j==516 % from here
break %
else %
continue %
end % to here
end
the marked block is useless. Simply omit it.
Hello Jan,
Thank you for your hints, I will change that! Do you have any other ideas for changing the decimal separator internally without having to create a new file?

Connectez-vous pour commenter.

 Réponse acceptée

Stephen23
Stephen23 le 6 Mai 2021
Modifié(e) : Stephen23 le 6 Mai 2021
Do NOT use CD to access data files: it is more efficient to use absolute/relative filenames (with FULLFILE).
To import a CSV (actually you appear to have a semi-colon-delimited fields) file with decimal comma simply select the appropriate options with READTABLE or READMATRIX:
[F,P] = uigetfile('*.csv');
T = readtable(fullfile(P,F), 'Delimiter',';', 'DecimalSeparator',',', 'VariableNamingRule','preserve')
Adapt to suit your file. If you had uploaded an actual data file (instead of screenshots) by clicking the paperclip button then I would have tested this as well.

17 commentaires

Thank you very much Steven. I have tried to do it via readtable but unfortunately I was not able to specify the range in the Release 2019b. It is very important for me to be able to specify the range since I don't want all the data from the table. I've read in the documentation and tried to use the 'Range' parameter but this did not work for 2019b. Is there a way to specify the range of readtable in 2019b?
The Range option has been supported by readtable ever since the function was introduced in R2013b.
If you show the code you used, I can probably help you to debug it. Debugging invisible code is much harder.
Thanks.
I've tried
[F,P] = uigetfile('*.csv');
T = readtable(fullfile(P,F), 'Delimiter',';','Range','10:20', 'DecimalSeparator',',', 'VariableNamingRule','preserve')
but it returns
Error using readtable (line 223)
Invalid parameter name: Range.
Same for DecimalSeparator. The file I imported is an excel spreadsheet.
Stephen23
Stephen23 le 11 Mai 2021
Modifié(e) : Stephen23 le 11 Mai 2021
"The file I imported is an excel spreadsheet."
This is confusing: do you mean that the file is actually a proprietary Excel format (e.g. .XLSX, .XLSB) or is the file is a standard text file (formatted with semi-colon delimited fields), as your code and question both indicate?
Apparently the Range option applies only to Excel file formats. I do not see any obvious way to restrict the importing range for text files. Unless you have extremely large files (many GB) you could just import all data and select what you need from within MATLAB.
Sorry, my bad, it's not a spreadsheet but a standard delimited file as u pointed out correctly. I have tried the Range option because the readtable function gave me the following error:
[F,P] = uigetfile('*.csv');
T = readtable(fullfile(P,F));
Error using readtable (line 223)
Reading failed at line 26733. All lines of a text file must have the same number of delimiters. Line 26733 has 0 delimiters, while preceding lines have 515.
This is because the next lines after line 26732 have less columns than the rows before (all my other files have this format).
I didn't find a different solution and then tried to specify the range. Do you have a solution for that which does not involve deleting the last rows? It would also be great if that solution could deal with the decimal separator topic as well. Thanks for your time!
@Dennis B: please upload a sample file by clicking on the paperclip button. You can use fake data if required.
Hello Stephen,
I have been working on that problem the last days and figured out that this problem only occurs when I try to read the data from ,,fresh'' files that have been created by the machine and have not been saved manually. As soon as I open one of these files and save it as a csv, the problem disappears and importing the data as a table via readtable works correctly without giving the aforementioned error. Nonetheless, I can't use the 'DecimalSeparator' command and receive the following error:
%Used Code
[F,P] = uigetfile('*.csv');
T = readtable(fullfile(P,F), 'Delimiter',';', 'DecimalSeparator',',', 'VariableNamingRule','preserve')
%Error:
Error using readtable (line 223)
Invalid parameter name: DecimalSeparator.
Do you have any solution for that?
I have also attached a file for you.
Thanks
Stephen23
Stephen23 le 14 Mai 2021
Modifié(e) : Stephen23 le 15 Mai 2021
"As soon as I open one of these files and save it as a csv, the problem disappears..."
What tool/app are you using to open and save the file?
Can you please upload two otherwise identical files: one "fresh" and one hand-saved one? I can compare them.
"This is because the next lines after line 26732 have less columns than the rows before..."
The file has several tables of data, each with their own header and preceding lede.
Which of those blocks/tables of data do you want to import?
Is the number of columns in the block that you want constant?
str = fileread('Example_Data.csv');
str = strrep(str,',','.');
[hdr,idx] = regexp(str,'^TimeStamp;[^\n]+','lineanchors','match','end','once');
hdr = regexp(hdr,'[^;]+','match')
hdr = 1×43 cell array
{'TimeStamp'} {'70788.1.C670010_V1:1'} {'70788.1.C670011_V1:1'} {'70788.1.C670012_V1:1'} {'70788.1.C670013_V1:1'} {'70788.1.E601000_W1:6'} {'70788.1.E601000_X1:6'} {'70788.1.E602000_W1:6'} {'70788.1.E602000_X1:6'} {'70788.1.E604001_X1:10'} {'70788.1.E604002_X1:10'} {'70788.1.E604003_X1:10'} {'70788.1.E604004_X1:10'} {'70788.1.E604005_X1:10'} {'70788.1.E604006_X1:10'} {'70788.1.E604007_X1:10'} {'70788.1.E604008_X1:10'} {'70788.1.E604009_X1:10'} {'70788.1.E604010_X1:10'} {'70788.1.E604011_X1:10'} {'70788.1.E604012_X1:10'} {'70788.1.E604013_X1:10'} {'70788.1.E604014_X1:10'} {'70788.1.E604015_X1:10'} {'70788.1.E604016_X1:10'} {'70788.1.E604017_X1:10'} {'70788.1.E604018_X1:10'} {'70788.1.E604019_X1:10'} {'70788.1.E604020_X1:10'} {'70788.1.E604021_X1:10'} {'70788.1.E604022_X1:10'} {'70788.1.E604023_X1:10'} {'70788.1.E604024_X1:10'} {'70788.1.E604025_X1:10'} {'70788.1.E604026_X1:10'} {'70788.1.E604027_X1:10'} {'70788.1.E604028_X1:10'} {'70788.1.E604029_X1:10'} {'70788.1.E604030_X1:10'} {'70788.1.E604031_X1:10'} {'70788.1.E604032_X1:10'} {'70788.1.E604033_X1:10'} {'70788.1.E604034_X1:10←'}
opt = {'Delimiter',';', 'CollectOutput',true};
fmt = repmat('%f',1,numel(hdr)-1);
fmt = ['%{dd.MM.yyyy HH:mm:ss.SSSSSSSSS}D',fmt]; %01.01.1900 16:37:21.0000000
out = textscan(str(idx:end),fmt,opt{:})
out = 1×2 cell array
{33×1 datetime} {32×42 double}
out{1}
ans = 33×1 datetime array
01.01.1900 16:37:21.000000000 01.01.1900 16:37:21.000000100 01.01.1900 16:37:21.000000200 01.01.1900 16:37:21.000000300 01.01.1900 16:37:21.000000400 01.01.1900 16:37:21.000000500 01.01.1900 16:37:21.000000600 01.01.1900 16:37:21.000000700 01.01.1900 16:37:21.000000800 01.01.1900 16:37:21.000000900 01.01.1900 16:37:21.000001000 01.01.1900 16:37:21.000001100 01.01.1900 16:37:21.000001200 01.01.1900 16:37:21.000001299 01.01.1900 16:37:21.000001400 01.01.1900 16:37:21.000001500 01.01.1900 16:37:21.000001600 01.01.1900 16:37:21.000001700 01.01.1900 16:37:21.000001800 01.01.1900 16:37:21.000001900 01.01.1900 16:37:21.000002000 01.01.1900 16:37:21.000002100 01.01.1900 16:37:21.000002200 01.01.1900 16:37:21.000002300 01.01.1900 16:37:21.000002400 01.01.1900 16:37:21.000002500 01.01.1900 16:37:21.000002599 01.01.1900 16:37:21.000002700 01.01.1900 16:37:21.000002800 NaT
out{2}
ans = 32×42
0.9545 0.1770 0.9406 0.9716 30.0000 1.9600 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.7500 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 2.0400 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.9400 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.9700 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0170 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.9100 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0220 0.0250 0.0170 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.9000 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0220 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.9300 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.9500 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050 0.9545 0.1770 0.9406 0.9716 30.0000 1.9300 0 0.0300 0.0060 0.0010 0.0130 0.0060 0.0040 0.0160 0.0060 0.0250 0.0250 0.0140 0.0040 0.0150 0.0050 0.0040 0.0060 0.0050 0.0110 0.0050 0.0050 0.0100 0 0.0050
I generally open the 'fresh' files in excel and don't use any other tool. I sadly can't upload a fresh file due to non-disclosure issues and I can't use fake data because the problem does not occur after having saved the file. In your example, you exactly chose the block I want to examine, starting from 'TimeStamp'. The number of columns in the block is constant.
Thansk for providing your solution, I'll work my way into your code and try it out. Thanks for your time.
@Dennis B: unfortunately (or fortunately, depending on how you view things) Excel does a lot of magic when it imports text data, including ignoring quotation marks and even changing data: https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates
Once the file has been opened and saved via Excel the original file format is lost: Excel simply overwites the file with what it consider to be a CSV format, which can be quite different to what the file originally contained.
  • if you want to compare the two files then use a reliable file comparison tool (e.g. WinMerge).
  • if you want to know what the file really looks like then use a reliable text editor (e.g. notepad++).
@Stephen Cobeldick: Okay, that was very helpful, thanks!
@Dennis B excuse me, did you find a solution for your problem ? if yes, it would be very nice if you share the accurate function. Thanx
@Ahmed Ghouma, I am currently just using the small script I mentioned in the introduction (using 'strrep' to replace comma with dot) because otherwise I would've had to change all of the workflow I had developed already, sorry.
Greetings
@Dennis you can also take a look at those answers hier because i think we had the same problem https://www.mathworks.com/matlabcentral/answers/864680-problem-by-using-str2double-with-csv-file?s_tid=srchtitle
Thanks Stephen for your useful tip. I have one additional question: i also need to change the readtable options using data=readtable(filename, opts) AND to read comma separated numbers using data=readtable(filename, 'DecimalSeparator',',')
But i didn't succed in mixing both strategies like: data=readtable(filename, opts,'DecimalSeparator',',') .
Is there a way to do so?
@Louis-Marie: presumably the OPTS variable in your code is the object returned by DETECTIMPORTOPTIONS, in which case you can specify the 'DecimalSeparator' option there. It does not really make sense to specify the 'DecimalSeparator' option in READTABLE after calling DETECTIMPORTOPTIONS, because DETECTIMPORTOPTIONS will also need to know the decimal separator to detect e.g. which data are numeric.

Connectez-vous pour commenter.

Plus de réponses (0)

Produits

Version

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by