Faster ways to deal with bigger data (1 to 10 TB ish)

6 vues (au cours des 30 derniers jours)
Can Atalay
Can Atalay le 26 Oct 2021
Commenté : Can Atalay le 26 Oct 2021
There are some thousands of large .csv files (each is 8 GB max.) that I absolutely have to read top to bottom to do basic operations on them (they're in my hard drive, see attachment to get an idea of what's in them). I want to convert them to .mat files after reading them using readtable(), but reading them takes days - I need them fast. Could you help optimize my plan for converting them to a more managable format via MATLAB in a short time using my ~30 USD budget? I'm not expecting y'all to teach me things from scratch or give long answers but if you have any links I could check out or even a single bit of improvement I'd be greatful - just looking for a some direction.
My current plan is to;
1- Upload the .csv files to my cloud strorage from my hard drive
2- Get EC2 instance with ~32GB RAM and download everything there
3- readtable() all of the .csv files in a for loop
4- convert the cell
{"True";"False";..;"True"}
columns to 1s and 0s for all tables (which would make everything a double)
5- split doubles by their columns for faster access in the future
6- save all (column) doubles as .mat files with a simple filename convention
7- upload all .mat files back to my cloud storage
8- download them back to my hard drive
Note 1: I have relatively fast upload/download speed but my PC overheats so I can't really split the files and read them manually without breaking something - hence the cloud + download idea, but open to suggestions otherwise.
Note 2: The 4th and 5th columns aren't always the same as each other, the 7th and 8th aren't always true or always false respectively. They're all random.
  4 commentaires
Ive J
Ive J le 26 Oct 2021
tall datastores can be much faster than readtable when you're dealing with big data. Consider the following:
ds = tabularTextDatastore('sample.txt', 'TextType', 'string'); % handling strings are much more convenient than cell arrays of char
% do other modification on the datastore
ds = tall(ds);
% do QC, filtering, etc steps (you're safe, this step won't affect your RAM usage!):
% e.g:
ds.(7)(ds.(7) == "True") = 1; % similarly for column 8, and for "False"
ds.(7) = logical(double(ds.(7))); % convert to logical
ds = gather(ds); % now read the clean table into memory
% save to mat file: by converting the table into a struct and saving to a
% mat file, the loading/accessing to variables can be easier/more
% efficient: e.g. when you need only second variable, you can just
% Var2 = load("chunk1.mat", 'Var2');
ds = table2struct(ds, 'ToScalar', true);
save("chunk1.mat", '-struct', 'ds')
Can Atalay
Can Atalay le 26 Oct 2021
Thanks a bunch! This will help me big time working through the bigger ones :)

Connectez-vous pour commenter.

Réponses (0)

Produits


Version

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by