Processing Tall Arrays Taking Too Long

Question

0 votes

I have a tall array with about 1.7 billion rows of data and 14 columns. I want to be able to process this data in the same way that several examples (with airline data) do it. I am just trying to extract one column and find the mean. My code is something like:

ds = datastore('some-file.csv');
tt = tall(ds); %Mx14 tall table (M should be about 1.7 billion)
a = tt.V;      %Mx1 tall double %(M should be the same as above)
m = mean(a);   %One integer
gather_m = gather(m);

The gather step is taking way too much time. I haven't seen it complete at all. In the examples I have seen, this step is shown to be completed in a few seconds. Eventually, I want to be able to make calculations and plots, but I want to start by making this simple step work first. Can anyone recognize the problem and recommend a solution? I have parallel pool turned on and there are two workers.

Thank you very much.

2 commentaires
Afficher Aucune Masquer Aucune

Walter Roberson le 1 Nov 2017

I had not realized that tall arrays used parallel if available, but I see that they do; https://www.mathworks.com/help/distcomp/run-tall-arrays-on-a-parallel-pool.html

Avinash Rajendra le 1 Nov 2017

They do, but it still takes way too long to run. I feel like I'd be in good shape if the time to run the program was more manageable.

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

Kojiro Saito le 2 Nov 2017

Ouvrir dans MATLAB Online

0 votes

It may speed up by configuring read size of datastore. You can know the default read size by

ds.ReadSize

This is the data size which MATLAB reads from the file at one time. You can set higher size than your default and this will reduce file I/O. Please add ds.ReadSize setting, for example,

ds = datastore('some-file.csv');
ds.ReadSize = 100000; % Or higher
tt = tall(ds); %Mx14 tall table (M should be about 1.7 billion)
a = tt.V; Mx1 tall double %(M should be the same as above)
m = mean(a); %One integer
gather_m = gather(m);

Hope this help.

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

Avinash Rajendra le 1 Août 2018

No, I didn't get a satisfactory answer to this. I ended up switching to Python and Spark to get what I wanted.

Dominique Ingala le 7 Avr 2021

I came from Python and R... Same struggle. So I'm now trying Matlab... If you managed to fix this, please share some secrets. Thanks.

Connectez-vous pour commenter.

Processing Tall Arrays Taking Too Long

2 commentaires
Afficher Aucune Masquer Aucune

Réponses (1)

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

Catégories

Tags

Community Treasure Hunt

Processing Tall Arrays Taking Too Long

2 commentaires Afficher Aucune Masquer Aucune

Réponses (1)

4 commentaires Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

Catégories

Tags

Voir également

Community Treasure Hunt

2 commentaires
Afficher Aucune Masquer Aucune

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens