speed up renamecats/categorical multiple columns
1 vue (au cours des 30 derniers jours)
Afficher commentaires plus anciens
I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.
I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.
See below an example
tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
["123", "234", "345", "456", "567"]', ...
'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
["apple", "banana", "cat", "dog", "elephont"]', ...
'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
["East", "West", "North", "South", "Middle"]', ...
'VariableNames', {'Code', 'Meaning'});
Vars = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
dictionary.(Vars{iC}).Meaning);
end
Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).
Thank you!
0 commentaires
Réponses (1)
Campion Loong
le 9 Oct 2020
Hi Peng,
It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).
In this case, why not create the table with categorical to begin with:
tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"], dictionary.A.Code, dictionary.A.Meaning),...
categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
'VariableNames', {'A', 'B'});
There is no loop, faster and much more readable.
3 commentaires
Campion Loong
le 9 Oct 2020
If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.
If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:
Voir également
Catégories
En savoir plus sur Spreadsheets dans Help Center et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!