Summing values with database "group by" functionality

Question

0 votes

Tables in databases can easily be transformed by use of the "Group by" function.

Groups ususally occur on a common lable (such as state) while the data columns can be summed, averaged, counted, etc.

In the past, I've take matrices in matlab and dumped them into SQL tables where I can easily use this functionality.

This can take a long time to write the data andif it can be done in Matlab, before writign teh data, it would save considerable time.

Is there a way to do this in MatLab directly?

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Connectez-vous pour suivre l’activité

Answer 1

Sean de Wolski le 4 Mai 2012

Ouvrir dans MATLAB Online

1 vote

doc accumarray

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Answer 2

per isakson le 4 Mai 2012

Ouvrir dans MATLAB Online

0 votes

In the Statistical toolbox there is a class named Dataset. It has a method grpstats.

    grpstats
    Class: dataset
    Summary statistics by group for dataset arrays

I think that does exactly what you ask for and more.

With plain matlab I'm convinced that data in a cell array, a for loop and some logical indexing will do the job. If you provide a toy example of data someone here will give you a piece of code, which demonstrates the approach.

--- CONT. ---

Study the Statistical toolbox Exampel: "Using Dataset Arrays"

--- CONT. 2012-05-07 ---

@David: This "file" is a mess.

There are seven column headers or is it five?
The first and second data row each contains eight values.
The third data row contains nine values.
Both comma and space are used as list separators

Additional information is needed. How to identify "missing values", etc.. Or is space not a separator but part of text values? You tell me.

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

David le 4 Mai 2012

This looks like it's headed in the right direction but I'm still stumbling throught the syntax. From your suggestions, I think "datasets" offer the most promising solution.

Here is a sample data set:

Event ID Loss Rate EL Relative Rate

1949 4,468,871,680 0.000000% 14.2 0.0000009%

3216 3,484,544,256 0.000016% 574.8 0.0000470%

6443 3,387,036,160 0.000865% 29,301.4 0.0024670%

3721 3,143,826,176 0.000604% 18,994.9 0.0017230%

638 3,033,682,176 0.000025% 757.4 0.0000712%

2341 2,886,927,616 0.000727% 20,980.8 0.0020725%

2966 2,596,929,792 0.000077% 1,997.2 0.0002193%

3844 2,356,046,336 0.000051% 1,209.6 0.0001464%

5757 2,310,576,640 0.000650% 15,027.7 0.0018547%

3150 2,264,359,168 0.000208% 4,709.1 0.0005931%

1024 2,264,254,464 0.000589% 13,335.2 0.0016795%

5180 2,148,002,816 0.000861% 18,498.1 0.0024558%

6180 2,148,002,816 0.000861% 18,498.1 0.0024558%

2507 2,122,473,472 0.000019% 393.8 0.0000529%

570 2,027,159,296 0.000041% 831.3 0.0001169%

3860 2,012,201,088 0.000258% 5,188.2 0.0007353%

742 2,007,396,736 0.000617% 12,379.9 0.0017587%

1949 2,000,215,424 0.000439% 8,784.5 0.0012524%

3216 1,844,881,664 0.000096% 1,766.2 0.0002730%

6443 1,841,915,648 0.000149% 2,747.5 0.0004254%

3721 1,835,679,360 0.000190% 3,487.5 0.0005418%

638 1,795,809,024 0.000120% 2,149.3 0.0003413%

2341 1,731,981,312 0.005341% 92,510.9 0.0152320%

2966 1,703,602,432 0.000181% 3,091.1 0.0005174%

3844 1,631,297,920 0.000636% 10,367.2 0.0018123%

5757 1,616,631,936 0.001162% 18,783.5 0.0033134%

3150 1,603,798,272 0.002974% 47,698.8 0.0084814%

1024 1,591,393,280 0.010651% 169,495.7 0.0303731%

5180 1,591,393,280 0.010651% 169,495.7 0.0303731%

6180 1,576,680,576 0.002123% 33,470.1 0.0060537%

2507 1,556,411,008 0.009640% 150,039.4 0.0274909%

570 1,543,776,000 0.000003% 50.7 0.0000094%

3860 1,524,221,696 0.000029% 435.2 0.0000814%

Walter Roberson le 7 Mai 2012

Looks to me as if comma is used as a decimal grouping in this file.

per isakson le 8 Mai 2012

@Walter, yes indeed.

Connectez-vous pour commenter.

Answer 3

Peter Perkins le 7 Mai 2012

Ouvrir dans MATLAB Online

0 votes

David, your example data has a few problems, notably the percent signs. Without having any details about what you are trying to do (and in particular how you might want to define groups in your data), here is an example of what you might do. This example uses a dataset array, but since you have nothing but numeric data, there's no reason why you could not use grpstats on a matrix.

>> % remove embedded spaces from last column header in file
>> % remove stray space from end of next to last line
>> data = dataset('File','tmp.dat','Delimiter',' ');
>> % remove the percent signs
>> data.ID = uint64(data.ID)/100;
>> data.Loss = str2double(strrep(data.Loss,'%',''))/100;
>> data.EL = str2double(strrep(data.EL,'%',''));
>> data
data = 
    Event    ID         Loss          Rate    EL
    1949     44688717          0       14.2      9e-07     
    3216     34845443    1.6e-07      574.8    4.7e-05     
    6443     33870362   8.65e-06      29301   0.002467     
    3721     31438262   6.04e-06      18995   0.001723     
[snip]
     570     15437760      3e-08       50.7    9.4e-06     
    3860     15242217    2.9e-07      435.2   8.14e-05     
>> data.LossGroup = ...
      ordinal(data.Loss,{'High' 'Low'},[],[0,median(data.Loss),Inf])
data = 
     Event    ID          Loss      Rate   EL         LossGroup
     1949     44688717          0    14.2     9e-07   High     
     3216     34845443    1.6e-07   574.8   4.7e-05   High     
     6443     33870362   8.65e-06   29301  0.002467   Low      
     3721     31438262   6.04e-06   18995  0.001723   Low      
    [snip]
      570     15437760      3e-08    50.7   9.4e-06   High     
     3860     15242217    2.9e-07   435.2  8.14e-05   High     
>> lossGroupMeans = grpstats(data,'LossGroup','mean', ...
                             'DataVars',{'Rate' 'ELRelativeRate'})
lossGroupMeans = 
            LossGroup    GroupCount    mean_Rate    mean_EL
    High    High         16            1837.7       0.00026079         
    Low     Low          17             49862        0.0082853

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

per isakson le 7 Mai 2012

Ok, comma serves as a thousand separator.

Connectez-vous pour commenter.

Answer 4

per isakson le 8 Mai 2012

Ouvrir dans MATLAB Online

0 votes

Here is a solution that doesn't require the Statistical toolbox.

Separators:

list separator: space
thousand separator: comma
traling: %

Approach:

read the first line (header) to a separate variable.
read the rest of the file to a string buffer
remove "," and "%" from the string buffer
read the string buffer with textscan
convert the cell array of double vectors to a double array

[ hdr, M ] = Read_text_file();

The whole numbers in the file have been converted to "flints" (see Floating Points ). The mean of the "rows", for which Event==638, can be calculated with logical indexing.

mean( M( M(:,1)==638, : ), 1 )

With "flint" it is safe to use "==". With floating point numbers one need to allow for rounding errors

mean( M( abs(M(:,5)-0.0024558) < epsilon, : ), 1 )

where epsilon is some appropriate small number.

--- Attachment ---

    function   [ hdr, M ] = Read_text_file()       
        fid = fopen( 'Read_text_file.txt', 'r' );
        hdr = fgetl( fid );
        str = fread( fid, '*char' );
        sts = fclose( fid );       %#ok<NASGU>
        str( str == ',' ) = [];
        str( str == '%' ) = [];
        cac = textscan( str, '%f%f%f%f%f' );
        M   = [ cac{:} ];        
    end

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Summing values with database "group by" functionality

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponses (4)

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Catégories

Tags

Community Treasure Hunt

Summing values with database "group by" functionality

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponses (4)

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

4 commentaires Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

1 commentaire Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Catégories

Tags

Voir également

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciens Masquer -1 commentaires plus anciens

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens