- file (the filename, or, preferably, the full path to your file)
- The NumerVariables value (number of columns of data)

4 views (last 30 days)

Hi

I have a csv file containing a large number of numbers and a few random strings like 'zgdf'. I need to find them and set them to zero. I cannot use 'csvread' (due to strings), so I use 'textscan' to read the file.

I then turn the data to digits using str2double. MATLAB then turns the string values to NaN which is fine for me, but it takes a long time, specially because this has to be done for many similar files.

Any faster method to sort this out?

This is how I read the data (original file has two columns and large number or rows):

fileID = fopen(filename);

C = textscan(fileID,'%s %s','Delimiter',',');

fclose(fileID);

for i = 1: length (C{1})

D(i) = str2double(C{1}{i});

end

Thanks

Adam Danz
on 20 Nov 2019

Edited: Adam Danz
on 21 Nov 2019

[This answer has been reorganized following the discussion in the comment section under the question]

Method 1

fid = fopen('myCSVfile.csv');

C = textscan(fid,'%s %s','Delimiter',',');

fclose(fid);

A = str2double(C{1}); % Faster than doing the same thing in a loop.

[update] the loop method below is actually faster

A = zeros(size(C{1})); % <--- always pre-allocate!

for i = 1:numel(C{1})

A = str2double(C{1}{i});

end

Method 2

Try this modification of the script produced by ImportData tool. Rather than importing your data and then converting it using str2double(), this imports the data as numeric and replaces non-numeric elements with NaN. I think it should be faster than your approach but I doubt it is much faster (or maybe it's not faster at all).

The only 2 variables you'll need to change to adapt to your data are

- file (the filename, or, preferably, the full path to your file)
- The NumerVariables value (number of columns of data)

%% Setup the Import Options and import the data

file = "C:\Users\name\Documents\MATLAB\myCSVfile.csv"; % Full path to your file (or just file name)

opts = delimitedTextImportOptions("NumVariables", 2); % Number of columns of data

opts.VariableTypes(:) = {'double'}; % read in all data as double (nan for strings)

opts.Delimiter = ",";

opts.ExtraColumnsRule = "ignore";

opts.EmptyLineRule = "read";

Data = readtable(file, opts); % Read in as table

Data = Data{:,:}; % Convert to matrix

Method 3

D = zeros(size(C{1})); % <--- pre-allocate!

for j = 1: length (C{1})

s = sscanf(C{1}{j},'%f');

if ~isempty(s)

D(j) = s;

end

end

This is 4.5x faster than method 1.

Method 4

This FEX function is designed to overcome the slow speed of str2double()

Method 5

A very fast solution is to read the data in using readmatrx() which automatically converts non-numeric elements to NaN but it requires r2019a.

file = 'myCSVfile.csv';

D = readmatrix(file); %that's it, just 2 lines

Ridwan Alam
on 20 Nov 2019

Edited: Ridwan Alam
on 21 Nov 2019

Given, the list of noise is {'a', 'b', 'ee'}:

C = cell2mat(textscan(fileID,'%f %f','Delimiter',',','TreatAsEmpty',{'a','b','ee'},'EmptyValue',0));

Try this!!

%% Old Answer

Updated using Method 1 from Adam:

C = textscan(fileID,'%s %s','Delimiter',',');

C = [str2double(C{1}) str2double(C{2})];

C(isnan(C)) = 0;

per isakson
on 21 Nov 2019

Edited: per isakson
on 23 Nov 2019

"random strings like 'zgdf'" If that means letters of the US alphabet, this code is rather fast.

%%

chr = fileread('cssm.txt');

chr = regexprep( chr, '[A-Za-z]+', '0.0' );

cac = textscan( chr, '%f%f', 'Delimiter',',', 'CollectOutput',true );

num = cac{1};

result

>> num(1:10,:)

ans =

0.81472 0.15761

0 0.97059

0.12699 0.95717

0.91338 0.48538

0.63236 0.80028

0.09754 0.14189

0.2785 0

0.54688 0.91574

0 0.79221

0.96489 0.95949

Where cssm.txt contains

0.81472, 0.15761

abc , 0.97059

0.12699, 0.95717

0.91338, 0.48538

0.63236, 0.80028

0.09754, 0.14189

0.27850, def

0.54688, 0.91574

zgdf , 0.79221

0.96489, 0.95949

et cetera

In response to comments

See the caveat in the first line of my answer.

I fail to find a regular expression for "not a legal number" and if one exists it might not be that fast.

It's straight forward to add a few (many becomes impractical) characters, e.g. '^â', and make sure that the string is followed by comma or end of line.

>> chr = regexprep( '12.3, abc, g^â, 1.0e5, def ', '(?m)[A-Za-zâ^]+(?=\x20*\r?(,|$))', '0.0' )

chr =

'12.3, 0.0, 0.0, 1.0e5, 0.0 '

>>

Look ahead, e.g. '(?=\x20*\r?(,|$))', is reasonable fast, but look behind sometimes ruins the performance.

The above regex fails for 'def1', '1deg' and '10a'

fileread in combination with CRLF as newline character poses a problem when using regular expressions. The anchor $ doesn't recognise CRLF as newline. (Please tell me if I missed something.) The best way to avoid this problem is to replace fileread by a function that uses

[fid, msg] = fopen( filespec, 'rt' );

chr = fread( fid, inf, '*char' );

Sign in to answer this question.

Opportunities for recent engineering grads.

Apply Today
## 10 Comments

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769461

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769461

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769485

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769485

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769497

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769497

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769503

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769503

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769507

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769507

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769561

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769561

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769574

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769574

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769578

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769578

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769580

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769580

## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769582

⋮## Direct link to this comment

https://fr.mathworks.com/matlabcentral/answers/492192-how-to-find-strings-in-a-very-large-array-of-data#comment_769582

Sign in to comment.