Indirectly addressing a tall datastore -- MATLAB Golf

Question

Michael le 2 Sep 2019

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/478611-indirectly-addressing-a-tall-datastore-matlab-golf

Commenté : Michael le 3 Sep 2019

Ouvrir dans MATLAB Online

Hello,

I'm hoping to clean up a bit of ugly code.

I have a tall array in this format with DATE and TICKER being the indexes:

>> head(tds)
ans =  8×4 tall table
        DATE           TICKER                    FIELD                  VALUE   
    ____________    _____________    _____________________________    __________
    '09/30/1997'    ARBABAL INDEX    ACTUAL_RELEASE                   '-581'    
    '09/30/1997'    ARBABAL INDEX    BN_SURVEY_NUMBER_OBSERVATIONS    '0'       
    '09/30/1997'    ARBABAL INDEX    ECO_RELEASE_DT                   '19971106'
    '10/31/1997'    ARBABAL INDEX    ACTUAL_RELEASE                   '-585'    
    '10/31/1997'    ARBABAL INDEX    BN_SURVEY_MEDIAN                 '-553'    
    '10/31/1997'    ARBABAL INDEX    BN_SURVEY_NUMBER_OBSERVATIONS    '0'       
    '10/31/1997'    ARBABAL INDEX    ECO_RELEASE_DT                   '19971205'
    '11/30/1997'    ARBABAL INDEX    ACTUAL_RELEASE                   '-907'    

Keeping the column names in the FIELD column allows for a lot of fexibility but it's really not that useful unless it is pivoted using unstack.

>> unstack(tds,'VALUE','FIELD')
ans =  3×6 table
        DATE           TICKER        ACTUAL_RELEASE    BN_SURVEY_MEDIAN    BN_SURVEY_NUMBER_OBSERVATIONS    ECO_RELEASE_DT
    ____________    _____________    ______________    ________________    _____________________________    ______________
    '09/30/1997'    ARBABAL INDEX        '-581'             ''                          '0'                   '19971106'  
    '10/31/1997'    ARBABAL INDEX        '-585'             '-553'                      '0'                   '19971205'  
    '11/30/1997'    ARBABAL INDEX        '-907'             ''                          ''                    ''            

Unfortunately, the tall array is very tall, much taller than 8, and cannot be held in memory so I'm using a datastore. Also the pivoted table would be very wide, much wider than 6.

I need to do something like this:

    select DATE, TICKER, ACTUAL_RELEASE, BN_SURVEY_MEDIAN from tds where str2double(BN_SURVEY_NUMBER_OBSERVATIONS) > 0

It would be easy in the pivoted table, but is pretty ugly in the tall table.

Could someone find an elegant soultion, or at least less ugly?

Thanks,

Michael

PS, this is what I have:

ds = datastore('talltable.csv');                        % CREATE A DATASTORE
ds.SelectedFormats={'%q','%C','%C','%s'};               % DEFINE COLUMN FORMATS
ds.VariableNames = {'DATE','TICKER','FIELD','VALUE'};   % DEFINE COLUMN NAMES
tds = tall(ds);                                         % CREATE A TALL ARRAY
d.summary = summary(tds);                               % GET SUMMARY STATISTICS
i   = find(tds.FIELD=='BN_SURVEY_NUMBER_OBSERVATIONS'); % WHICH ROWS CONTAIN THE NUMBER OF OBSERVATIONS?
j   = i(str2double(tds.VALUE( i)) >10);                 % WHICH OF THOSE ROWS HAVE MORE THAN 10 OBSERVATIONS?
J   = gather(j);
idx = [(tds.DATE) cellstr(tds.TICKER)];                 % CREATE AN INDEX OF DATE AND TICKER COMBINATIONS
IDX = gather(idx);
I   = strcat(IDX(:,1),IDX(:,2)); 
ten = unique(I(J));                                     % WHICH OF THESE CORRESPOND TO J
[I, ten] =gather(I,ten);
big = ismember(I, ten);                                 % WHICH ELEMENTS IN THE TALL ARRAY CORRESPOND TO TEN
ss  = tds(big, :);                                      % EXTRACT THE TABLE
SS=gather(ss);
t = unstack(SS,'VALUE','FIELD');                        % UNSTACK

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Guillaume le 2 Sep 2019

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/478611-indirectly-addressing-a-tall-datastore-matlab-golf#answer_390126

Modifié(e) : Guillaume le 2 Sep 2019

Ouvrir dans MATLAB Online

Is your FIELD column imported as a string array (as opposed to cellstr)? As a cellstr the == 'BN_SURVEY...' wouldn't work, you'd have to use strcmp.

What's very ugly in my opinion is that you store numbers as strings/char vectors, which is going to slow things down significantly.

Anyway, this sounds like a job for findgroups and splitapply.

First, the selection function:

function [release, surveymed, observations] = getselectedfield(fields, values)
    release = str2double(values(strcmp(fields, 'ACTUAL_RELEASE')));
    median = str2dobule(values(strcmp(fields, 'BN_SURVEY_MEDIAN')));
    observations = str2double(values(strcmp(fields, 'BN_SURVEY_NUMBER_OBSERVATIONS'))
end

then:

[group, groupdate, groupticker] = findgroups(tds.DATE, tds.TICKER);
[releases, surveymed, observations] = splitapply(@getselectedfields, tds.FIELD, tds.VALUE, group);
selected = table(groupdate, groupticker, releases, surveymed, observations);
selected = gather(selected(selected.observations > 0, :));

Untested since you haven't provided example data as a mat file. It may be that some of these functions don't work with a tall array.

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

Guillaume le 3 Sep 2019

Modifié(e) : Guillaume le 3 Sep 2019

Yes, it probably would be more optimal to do the filtering for the number of operations earlier, but I don't think it's possible. There's no way to tell splitapply to just discard the current group. So, yes you'll have temporarilly more rows than needed but I don't think it will have much impact on speed (although I haven't tested).

Even if you can't provide your dataset, it's useful to have some dummy data that follows the same format for testing against.

With regards to the error, that would happen if for the same group (same date/ticker combination) you have 0 or more than 1 'ACTUAL_RELEASE', 'BN_SURVEY_MEDIAN' or 'BN_SURVEY_NUMBER_OBSERVATIONS'. I haven't guarded against that as I didn't think it would be the case. I'm not sure what should be done in that case.

Michael le 3 Sep 2019

Sorry, I accidentally answered instead of commented.

Connectez-vous pour commenter.

Answer 2

Michael le 3 Sep 2019

0
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/478611-indirectly-addressing-a-tall-datastore-matlab-golf#answer_390233

tallEcon_Anon.zip

Dear Guillaume,

Thank you.

1) It would be more than optimal because the data set is huge, but I think I can manage this way. Right now, I'm breaking up the data into several smaller files but I'd prefer to have one large file for simplicity.

2) I am attaching a dummy data set but I'd like to use this technique on different files with different field names and diferent queries as well. This file format is very flexible and convenient if only I could make the queries nicer. I'm also working on a database solution but I'd like to do it all in Matlab if I can.

3) Yes, there are very many repetitions for FIELD but each has row a unique FIELD, TICKER and DATE combination. It will be apparent in the dummy data.

Thanks for the help,

Michael

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Indirectly addressing a tall datastore -- MATLAB Golf

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (2)

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

Indirectly addressing a tall datastore -- MATLAB Golf

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Réponses (2)

4 commentaires Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

Voir également

Catégories

Tags

Produits

Version

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens

4 commentaires
Afficher 2 commentaires plus anciensMasquer 2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciensMasquer -2 commentaires plus anciens