Big data question. how to generate a variable efficiently and aggregate
Afficher commentaires plus anciens
I have a file of tens of millions observations with a string identifier, which I load as a datastore:
- ............. V1 ..... V2 ............ V3 ........ V4
- # # * # KLM88 2001-06-30 10 COMPANY1
- # # * # KLM88 2000-12-31 20 COMPANY1
- # # * # MNH7C 2001-09-30 23 COMPANY1
- # # * # MNH7C 2001-06-30 15 COMPANY1
- # # * # MNH7C 2000-12-31 6 COMPANY1
- # # * # HG9LB 2000-12-31 2 COMPANY1
I also have a mat file with some extra information and matching of first variable:
- # KLM88 COUNTRYA
- # MNH7C COUNTRYA
- # HG9LB COUNTRYB
I wish for an end result such that I aggregate on country and date and company my dataset :
- # * # 2001-09-30 23 COMPANY1 COUNTRYA
- # * # 2001-06-30 25 COMPANY1 COUNTRYA
- # * # 2000-12-31 26 COMPANY1 COUNTRYA
- # * # HG9LB 2000-12-31 2 COMPANY1 COUNTRYB
I know I can do so by reading per dataChunk and with for loop assigning the country. However, that takes a huge amount of time. Any other suggestions of how to do so? I am fairly new to the concepts of tall arrays/ mapreduce etc. Thus, I am not sure how could I arrive to what I want more efficiently.
Réponse acceptée
Plus de réponses (0)
Catégories
En savoir plus sur MapReduce dans Centre d'aide et File Exchange
Produits
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!