Store different data types efficiently
Afficher commentaires plus anciens
Hi all,
I have a dataset of company announcements. After parsing I end up with several variables for every announement, like the body text (string array), the company name (string), the announcement date (double), etc. Until now, I stored these variables for all documents in a large struct array ( I have 55000 documents, resulting in a huge struct array). Unfortunately, it takes very long to load this struct in the workspace. Additionally, Matlab gets very slow in this case. Do you have a recommendation how to solve this problem?
I would be very grateful for every hint.
Thank you!
6 commentaires
per isakson
le 29 Oct 2020
Modifié(e) : per isakson
le 29 Oct 2020
"company announcements" is that pure text and numbers?
"it takes very long to load this struct in the workspace" Do you need the entire "database" in memory simultaneously?
How do you use that huge structure?
You might want to look at SQLite. There is support in the Database toolbox and in the File Exchange.
Moritz Scherrmann
le 29 Oct 2020
Modifié(e) : Moritz Scherrmann
le 29 Oct 2020
per isakson
le 29 Oct 2020
"I use mat-file version '-v7.3'" Version '-7.3' is slow. How large is the structure? If it's less than 2GB you should test with version 7.0 and maybe even 6.0. See MAT-File Versions.
Moritz Scherrmann
le 29 Oct 2020
Mario Malic
le 29 Oct 2020
Modifié(e) : Mario Malic
le 29 Oct 2020
You can also take a look at datastore function that deals with large databases. This might be more efficient, especially when you only need some files in memory. Unfortunately, i can't give specific hints as I haven't worked with it yet.
Reading 55000 documents is probably what takes the most of the time, parallelisation would speed things up, if applicable.
J. Alex Lee
le 29 Oct 2020
I also vote for sqlite if you need fast/random access into the data and don't mind needing to do a slow one-time import plus update insertions every time you get a new announcement file.
I've played a bit with datastore, but I find it's not really for performance (slow). And if your data is mostly text and you don't need to do aggregation computations or otherwise operate on large virtual arrays/tables in a native matlab-like language, it's not really clear to me (out of my own ignorance) that datastore will be any more useful than just creating an sqlite database.
I've tried to study how a non-mat HDF file could help me in my own application; even if you did achieve some kind of better control over the data saving, I have not been able to figure out how to do a random access of only chunks of the HDF file.
Réponse acceptée
Plus de réponses (1)
Peter Perkins
le 19 Nov 2020
Moritz, if "I stored these variables for all documents in a large struct array" is literally true, then that's your problem. I mean, you can use HDF or datastore or whatever, but you should consider using a table in a mat file instead of a struct array in a mat file. The struct array (assuming you do not mean a scalar struct OF arrays) is not an efficient way to store homogeneous "records". Consider the following:
>> t = array2table(rand(55000,10));
>> s = table2struct(t);
>> whos t s
Name Size Bytes Class Attributes
s 55000x1 61600640 struct
t 55000x10 4402930 table
A factor of 10. I have no idea what your data really look like or how fast they would load as a table in a mat file, but it's worth looking at. You will also find that a table makes selecting subsets of your data much easier than a struct array.
Catégories
En savoir plus sur Workspace Variables and MAT Files dans Centre d'aide et File Exchange
Produits
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!