Calculating group statistics using frequency weights

I have a large table of survey data with about 3 Mio. observations and 120 variables. The survey also contains a variable called "weights" with integers between 250 and 250'000. These weights are intended to render the sample representative. Hence I have to weight every observation for every calculation. I need a number of (weighted) group means, medians and percentiles.
How can I calculate summary statistics and group statistics, weighting the observations with the frequency weights?
Considering the size of the dataset, I am looking for a solution that works for a large dataset of dimensions I mentioned above. (If I needed the unweighted statistics, it can easily and efficiently be done with the groupsummary() function, I have however not found an option allowing for the weights.)
Thanks for your help
Chris

5 commentaires

If your weights are integers, one option can be to repeat the observations, using the weights, and then calculate the statistics.
v = [1 2 3 4];
w = [2 2 3 3];
u = repelem(v,w);
% this will repeat the observations in v as specified in w.
Its also possible that you can try to convert your weights into integers by multiplying with some number. Though not sure if that may work correctly.
Thanks! This works in principle. But with 3 Mio. observations and weights that go up to the 10^6, this approach will break my code.
Since this is a fairly standard approach in survey analysis, I was hoping to find this implemented somewhere.
Adam Danz
Adam Danz le 17 Déc 2019
Modifié(e) : Adam Danz le 17 Déc 2019
My first thoughts when I saw this question yesterday were the same as Mohammad Sami's but I didn't offer that suggestion because of the problem described above.
"Since this is a fairly standard approach in survey analysis..."
Could you provide an example? Your question only referrs to statistics. What kind of statistics are you aiming for? Means? Std? Variance? What do your weights look like? Are they integers? What range do they have? If you could describe one of the standard approaches you mention, I'm sure there's a simple way to implement weights in whatever statistics you're aiming for.
Thanks for your comment. I have updated the question to give more context.
I added an answer that shows how to scale your inputs according to your weights.

Connectez-vous pour commenter.

Réponses (1)

Since you integer weights are much too large to merely replicate values based on the weights, you can scale your data according to your weights. This is only 1 of many interpretations of applying weights.
There are several ways around this and the best method depends on how you're using the weights, what those weights mean, and the bounds of those weights. That's something you'll need to think about.
Here's my proposal.
% Create demo data
x = randi(10,1,20); % Main data: 20 random integers
w = randi(24750,size(x))+250; % random weights between 250 and 250000
This is the part you'll need to consider. The idea is to normalize your weights between [0,1] but keep in mind that a weight of 0 will completely eliminate a value.
% Normalize the weights
% If you know the upper and lower limits of the weights (Safer than alternative)
knownWeightBounds = [250,250000];
% Or maybe use
knownWeightBounds = [0,250000];
% or maybe
knownWeightBounds = [min(w), max(w)];
% Scale the weights to 0:1
wNorm = (w-knownWeightBounds(1))/range(knownWeightBounds);
Now scale your data according to the weights.
% Scale the data by normalized weights.
xScaled = x .* wNorm;
% compute whatever stats you want on xScaled
mu = mean(xScaled)

Question posée :

le 16 Déc 2019

Commenté :

le 18 Déc 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by