data management for large datasets
2 vues (au cours des 30 derniers jours)
I will have two sets of field data -- one taken for six weeks last year, and another taken for two months this year. For EACH dataset I have variables collected from 4-8 different sources, for (up to) 50 days, collected at up to 3 different sites. Both datasets, in their entirety, span about 200 - 300 columns and between 8,000 - 15,000 rows.
Within that, I'm trying to figure out how to set up my code for analysing both sets of data. I want to do some different things --
- Analyse the data from each source separately to check for error
- Filter out a large quantity (up to 25%) of data which is poor quality
- Check all the filtered data from ONE dataset for trends between days (rows) and variables (columns)
- Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)
- Compare different (filtered) variables within a single dataset over time, and
- Perform analysis on the changes between both (filtered) datasets.
I have no idea how to structure and maintain my code to allow me to do all of these things. I know some of the tests I want to do but others I haven't thought of yet. At the moment I have about 10 different programs which load and structure my raw datafiles in different ways (one comprised of an array of structs, another where data is subset into variables etc), but this is incredibly confusing and has led to a lot of error and enormous amounts of repetition. Deeply nested structs became impossible to work with last year.
I will also have a set of images I want to analyse at the same time, taken from the same days, so I need to take that into account too.
Matlab is so powerful and there are so many ways of managing data. Does anyone have any ideas on organising such a large dataset to be able to analyse so many different parts of it?
Richard Willey le 16 Fév 2012
Have you looked into the dataset array that ships with Statistics Toolbox?
The dataset array is a special data type that can store heterogeneous data. (I can have a column of strings, followed by a column of categoricals, followed by a column of doubles,...)
The dataset array ships with a variety of built in methods that are designed to simplify data analysis. For example, there is a built in method for "joins" just like you'd find in a relational database. There's also a built in method for converting your data from a tall format to a wide format (and vice-versa)
As a practical example, you cite a requirement to "Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)". The join operation would make that a lot easier...