data management for large datasets

Question

Sara le 16 Fév 2012

1
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/29335-data-management-for-large-datasets

Hi,

I will have two sets of field data -- one taken for six weeks last year, and another taken for two months this year. For EACH dataset I have variables collected from 4-8 different sources, for (up to) 50 days, collected at up to 3 different sites. Both datasets, in their entirety, span about 200 - 300 columns and between 8,000 - 15,000 rows.

Within that, I'm trying to figure out how to set up my code for analysing both sets of data. I want to do some different things --

Analyse the data from each source separately to check for error
Filter out a large quantity (up to 25%) of data which is poor quality
Check all the filtered data from ONE dataset for trends between days (rows) and variables (columns)
Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)
Compare different (filtered) variables within a single dataset over time, and
Perform analysis on the changes between both (filtered) datasets.

I have no idea how to structure and maintain my code to allow me to do all of these things. I know some of the tests I want to do but others I haven't thought of yet. At the moment I have about 10 different programs which load and structure my raw datafiles in different ways (one comprised of an array of structs, another where data is subset into variables etc), but this is incredibly confusing and has led to a lot of error and enormous amounts of repetition. Deeply nested structs became impossible to work with last year.

I will also have a set of images I want to analyse at the same time, taken from the same days, so I need to take that into account too.

Matlab is so powerful and there are so many ways of managing data. Does anyone have any ideas on organising such a large dataset to be able to analyse so many different parts of it?

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Richard Willey le 16 Fév 2012

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/29335-data-management-for-large-datasets#answer_37681

Have you looked into the dataset array that ships with Statistics Toolbox?

The dataset array is a special data type that can store heterogeneous data. (I can have a column of strings, followed by a column of categoricals, followed by a column of doubles,...)

The dataset array ships with a variety of built in methods that are designed to simplify data analysis. For example, there is a built in method for "joins" just like you'd find in a relational database. There's also a built in method for converting your data from a tall format to a wide format (and vice-versa)

As a practical example, you cite a requirement to "Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)". The join operation would make that a lot easier...

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

data management for large datasets

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponse acceptée

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Community Treasure Hunt

data management for large datasets

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponse acceptée

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Plus de réponses (0)

Voir également

Catégories

Tags

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens