Need to speed up a regexprep implementation
Afficher commentaires plus anciens
Hi All,
I use the following MATLAB code to parse a large text file with *'s used as repeat symbols (courtesy of previous MATLAB advice).
% Expand repeats
fun = @(n,c)repmat(sprintf(' %s ',c),1,str2double(n));
n=find(contains(file,'*'));
for m=n; file(m) = regexprep(file(m),'\s*(\d+)\*(\S+)','${fun($1,$2)}'); end
clear n m fun;
An example (small) file is as follows. The single file for VAR has 1,555,181 lines and will expand to 37,720,320 values. The full file (VAR + other variables) is a 2145956x1 string array which saves to a 100MB .mat file, so a bit too large to post. The regexprep takes about 10 minutes, and is the slowest part of the file read.
Can anyone (more experienced user!) suggest a faster method of parsing the data records?
My thanks!
Mike King
p.s. As requested, a small sample file (ZCORN.zip, about 10% of a real example) is now attached.
VAR
3294.03 2*3293.74 2*3293.45 2*3293.15 3292.93 3371.97 2*3376.36 2*3380.67 2*3384.95 2*3389.14 2*3393.22 2*3397.16 2*3400.97
2*3404.64 2*3408.21 2*3411.7 2*3415.13 2*3418.49 2*3421.92 2*3425.18 2*3428.28 2*3431.28 2*3434.12 2*3436.84 2*3439.46 3441.96
3441.96 2*3444.37 2*3446.73 2*3448.97 2*3451.2 2*3453.35 2*3455.48 2*3457.6 2*3459.71 2*3461.81 2*3463.92 2*3466.09 3468.29
3468.29 2*3470.52 2*3472.85 2*3475.24 2*3477.72 2*3480.28 2*3482.92 2*3485.67 2*3488.52 2*3491.45 2*3494.45 2*3497.53 3500.67
3500.67 2*3503.84 2*3507.07 2*3510.32 2*3513.63 2*3516.96 2*3520.36 2*3523.76 2*3527.22 2*3530.75 2*3534.3 2*3537.84 3541.26
3541.26 2*3544.57 2*3547.68 2*3550.42 2*3552.83 2*3554.56 2*3555.71 2*3556.22 2*3556.05 2*3555.35 2*3553.85 2*3551.87 3549.46
3549.46 2*3546.73 2*3543.75 2*3540.54 2*3537.31 2*3534.16 2*3531.15 2*3528.34 2*3525.79 2*3523.43 2*3521.4 2*3519.76 3518.43
3518.43 2*3517.4 2*3516.81 2*3516.51 2*3516.62 2*3517.12 2*3517.97 2*3519.17 2*3520.69 2*3522.5 2*3524.51 2*3526.68 3528.91
3528.91 2*3531.24 2*3533.62 2*3536.07 2*3538.58 2*3541.21 2*3543.93 2*3546.77 2*3549.71 2*3552.75 2*3555.9 2*3559.13 3562.45
8153.47 2*8155.84 2*8158.5 2*8161.45 2*8164.68 2*8168.03 2*8171.71 2*8175.57 2*8179.59 2*8183.75 2*8188.04 2*8192.41 8196.84
8196.84 2*8201.31 2*8205.94 2*8210.66 2*8215.46 2*8220.35 2*8225.3 2*8230.37 2*8235.46 2*8240.67 2*8245.88 2*8251.14 8256.38
8256.38 2*8261.66 8267.28 8281.77 2*8282.08 2*8282.04 2*8282.01 2*8281.99 2*8281.96 2*8281.92 2*8281.85 2*8281.79 2*8281.77
2*8281.8 2*8281.95 2*8282.23 2*8282.73 2*8283.63 8284.38 /
7 commentaires
Why are you using the FOR loop?
Mike
le 9 Juin 2023
Mike
le 12 Juin 2023
"Can anyone (more experienced user!) suggest a faster method of parsing the data records?"
I have done a fair bit of playing around with dynamic regular expressions (e.g. words2num)... they are very useful, but not fast. I would recommend trying avoiding the dynamic function call, perhaps by somehow converting the REPMAT into either a pure regular expression (not dynamic) or pure MATLAB code.
Here is one approach:
str = '3294.03 2*3293.74 2*3293.45 2*3293.15 3292.93 3371.97 2*3376.36 2*3380.67 2*3384.95 2*3389.14 2*3393.22 2*3397.16 2*3400.97';
[T,S] = regexp(str,'\s*(\d+)\*(\S+)','tokens','split');
S = reshape(S,1,[]);
T = vertcat(T{:});
F = @(n,c)repmat(sprintf(' %s ',c{:}),1,n);
S(2,1:end-1) = arrayfun(F,str2double(T(:,1)),T(:,2),'uni',0); % FOR loop would be faster
out = sprintf('%s',S{:})
I will have a think about approaches using regular expressions.
Question: is there a limit to the value of n used in REPMAT? If so, what is that limit?
Réponses (1)
Piyush Dubey
le 30 Août 2023
Hi Mike,
I understand that you are trying to implement “regexprep” in MATLAB and the large size of the data file carrying your records takes a lot of time to process and parse the data records.
Please know that parsing large text files can be time consuming, especially when using regular expressions. In this case, to improve the performance of parsing data records, there are a couple of methodologies that can be used while processing data and make parsing faster. You can try out the following approaches to speed up the process:
- Try “vectorization” and reading the file record by record, instead of loading it all at once. Loading data to preallocated memory can also save some time that dynamic allocation of memory during runtime would consume. You can refer the following sample code snippet to perform this operation:
fid = fopen('your_file.txt', 'r');
data = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
file = data{1};
expandedFile = cell(size(file));
for i = 1:numel(file)
fields = strsplit(file{i}, ' ');
expandedFile{i} = repmat(fields{2}, 1, str2double(fields{1}));
end
- Faster performance can be achieved by using the “fread” function in MATLAB to read the binary data directly from the file. This approach avoids the overhead of text parsing and can significantly improve the processing speed. The following code snippet demonstrates the same: fid = fopen('your_file.txt', 'r'); binaryData = fread(fid, Inf, 'uint8=>char')'; fclose(fid); The “fread” function reads the entire file as binary data and stores it in the “binaryData” variable. The “Inf” argument specifies that it should read until the end of the file. The “uint8=>char” conversion is used to interpret the binary data as characters.
- Parallel processing can also be considered on processing MATLAB to leverage multiple CPU cores and “sspeed” up the parsing process. “parfor” loop can be used instead of regular “for” loop to access parallel looping over multiple records. Please refer to the following MATLAB documentation link for more information on “parfor”: https://in.mathworks.com/help/parallel-computing/parfor.html
I hope this helps.
1 commentaire
Mike
le 27 Déc 2023
Catégories
En savoir plus sur Matrix Indexing dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!