random vector v from uniform distribution at (0,1) with sum(v)=1

Hello,
How can I generate a uniformly distributed random vector with its sum to be equal to 1?
Thank you

 Réponse acceptée

Too many people think that generating a uniform sample, then normalizing by the sum will generate a uniform sample. In fact, this is NOT at all true.
A good way to visualize this is to generate that sample for the 2-d case. For example, suppose we do it the wrong way first?
xy = rand(100,2);
plot(xy(:,1),xy(:,2),'.')
Now, lets do the sum projection that virtually everyone poses. (Yes, it is the obvious choice. Now we will see why it is the wrong approach.)
xys = bsxfun(@rdivide,xy,sum(xy,2));
hold on
plot(xys(:,1),xys(:,2),'ro')
axis equal
axis square
The sum-projected points lie along the diagonal line. Note the distribution seems to be biased towards the middle of the line. A uniform sample would have points uniformly distributed along that line.
In a low number of dimensions there are some nice tricks to generate a sample that is indeed uniform. I tend to use Roger Stafford's submission to the file exchange, randfixedsum. It is efficient, and works in any number of dimensions.
figure
xyr = randfixedsum(2,100,1,0,1)';
plot(xyr(:,1),xyr(:,2),'ro')
axis equal
axis square

17 commentaires

I think this is a good illustration of something that is counter-intuitive, though I don't know that it really explains why this counter-intuitive result happens. Why does a sample set taken from a uniform distribution divided by a constant end up as no longer uniform?
For small dimensions you are right, however in larger dim there is no such problem. However, the info you provided was something I didn't know and is very useful.
Thank you.
Youssef  Khmou
Youssef Khmou le 14 Mar 2014
Modifié(e) : Youssef Khmou le 14 Mar 2014
Image Analyst's question is reasonable, in fact the intuition seems that multiplication by constant yields to sort of translation ( of the segment [a,b] of U)....
to John D'Errico, wy cant we say that, nothing happends to the distribution, its the same, by multiplication yields to either flattening/ decreasing or narrowing/increasing??
Isn't the reason that you don't get a uniform distribution basically because the chance you get a value close to to the mean is higher? You can compare it to throwing two dice and calculate the sum of the values of the dies. The chance you will get a 7 is much higher than getting 12. That is because you can get a 7 by having a 3 and a 4 or a 2 and 5 etc, while you can only get 12 by having two times 6. see: http://calculus-geometry.hubpages.com/hub/How-to-Compute-the-Probability-of-Rolling-a-Sum-with-Two-Dice
Exactly, Paul. In the limiting case, with X a vector each uniformly distributed, sum(X) is normally distributed.
Paul - That is one way of looking at it. Another way of looking at it is to look at the first plot I generated. There are fewer points in the 2-d original sample that map into a point on one of the ends of the line than those that map to a point at the center.
Or for another way that is asymptotically what Walter suggested, consider that the sum of two uniform random variables has a triangular distribution.
In any case, the point is that the a normalization scheme of dividing by the sum will not yield a uniform sample in the projected subspace.
I'm curious why a single dimension, uniformly distributed array of values ceases to be uniform when scaled with a linear transformation. Relative spacing between points is preserved. The "shape" of the data is unchanged, just the scale changes, no?
Your example is averaging pairs of points in a two dimensional distribution, which will obviously tend toward the mean of the distribution. I don't see how that would happen when there is not averaging being done.
Would you mind explaining?
The PDF of a sum of two random variables is the convolution of the two individual PDFs. So you take two uniform variables and convolve them and you get a triangle, which you can see in the red circles in John's plot above. Of course by the Central Limit Theorem if you do it for tons of rv's you get a Normal distribution as Walter Mentioned. You can observe this triangle for the simple case of the sum of a pair of dice like in the sample below.
% Roll a pair of dice a million times.
d = randi(6, 1000000,2);
s = sum(d,2); % Sum the two dice
% Get distribution of the sums.
edges = 2:12;
counts = histc(s, edges);
% Plot distribution of the sum, which will be a triangle
% which is the convolution of two uniformly distributed rv's.
plot(counts, 'ro-')
But when you just have a single rv and just divide the values by the sum so that they sum to 1 instead of what they used to sum to, I don't think the shape of the PDF will change. This little script seems to back that up:
% Roll one dice 10 million times.
d = randi(6, 100000, 1);
% Find the sum
theSum = sum(d)
% Get distribution of the rolls.
edges = 1:6;
counts = histc(d, edges);
max(counts)
% Plot distribution of the uniformly distributed rv's.
subplot(1,2,1);
plot(counts, 'ro-')
ylim([0, 1.5*max(counts)]);
title('Original', 'FontSize', 15);
grid on;
% Normalize by dividing by theSum so that new sum will = 1
d2 = d / theSum;
% Find the sum
theSum2 = sum(d2)
% Get distribution of the rolls.
edges2 = [1:6] / theSum;
counts2 = histc(d2, edges2);
max(counts2)
% Plot distribution of the uniformly distributed rv's.
subplot(1,2,2);
plot(counts2, 'ro-')
ylim([0, 1.5*max(counts2)]);
title('Normalized', 'FontSize', 15);
grid on;
% d and d2 are different but the PDFs (counts and counts2) are the same.
So I guess I can see Benjamin's point and would like clarification from John.
Matt J
Matt J le 14 Mai 2014
Modifié(e) : Matt J le 14 Mai 2014
Another way to understand it (in 2D) is to remember that the random vector [x1,x2]=rand(1,2) is drawn uniformly from the unit square, but the line segments intersecting the unit square are not all of equal length. The length of these segments affects the probability mass of each outcome for the normalized random vector v=[x1,x2]/(x1+x2).
As an example, consider the case where v=[.5,.5]. This result for v is obtained by any x1=x2, i.e. any pair on the main diagonal of the unit square, which has length sqrt(2).
Conversely, for v=[.8,.2], any pair (x1,x2) in the unit square and on the line x2=x1/4 will achieve this v. However this line segment only has length 1.0308<sqrt(2). I.e., it has lower probably mass than v=[0.5,0.5] does.
I'm curious why a single dimension, uniformly distributed array of values ceases to be uniform when scaled with a linear transformation. Relative spacing between points is preserved. The "shape" of the data is unchanged, just the scale changes, no?
@Ben,
I'm not sure where the notion that this is a linear transformation is coming from. We start with a sequence of i.i.d uniformly distributed variables x(i), i=1...N and transform to the non-i.i.d variables
v(j)=x(j)/(x(1)+x(2)+...x(N)), j=1,2,...N
The right hand side of the above is a highly nonlinear, coupled function of the x(i).
Perhaps the idea was to view 1/sum(x) like a simple scaling constant? We can't. It is a random variable, dependent in part on the very x(j) that we are scaling.
I'm not sure that it is correct to view x as a random variable, since x is fully known and unchanging once it has been computed. I believe at that point, it is merely a distribution and can be scaled without affecting uniformity.
I've had some trouble finding online documentation specifically regarding this, but here's an excerpt (and link) from a JMP page on the subject:
*****
Random Uniform
Generates random numbers uniformly between 0 and 1. This means that any number between 0 and 1 is as likely to be generated as any other. The result is an approximately even distribution. You can shift the distribution and change its range with constants. For example, 5 + Random Uniform()*20 generates uniform random numbers between 5 and 25.
*****
The MATLAB documentation claims that rand() produces an approximately uniform distribution. It would stand to reason that this distribution should also maintain its uniformity if shifted or scaled. Intentionally selecting a scaling factor a posteriori which results in the sum of the elements of the distribution equalling 1 does not appear to be a special case, and should still be a linear transformation.
However, what has occurred to me is that the process of scaling would alter the range of the distribution such that the range is no longer (0,1). If the range does not need to be maintained, my suggestion should be valid. If the range must be maintained, then another approach would be required.
Based on the original question, it does not appear to me that maintaining the range is a requirement.
I'm not sure that it is correct to view x as a random variable, since x is fully known and unchanging once it has been computed
It is not unchanging, because multiple realizations of x are to be computed. The "constant" you propose to scale/shift by is not a true constant because it is derived from x and therefore varies with realization also.
It is true, however, that if you scale/shift a uniform random variable by a (realization-independent) constant, the result will also be a uniform random variable, though with a different range, as you noted.
To look at it another way, what would you say is the distribution of the following, x
u=rand;
v=randn;
x=u*v;
Is x normally distributed because u can be viewed as a constant scale factor? Or is it uniformly distributed because v can be viewed as a constant scale factor?
Consider the set of points that gets mapped to any point along the line. There are simply MORE points that get mapped to the midpoint of the line, than those that get mapped to an end point of the line.
This must tell you that the distribution of points obtained by the renormalizing scheme is NOT uniform along the projected line.
There are schemes that DO generate a uniform distribution along that line, and they are absolutely trivial to write, at least in low numbers of dimensions. For example, in two dimensions,
A = [0 1];
B = [1 0];
t = rand(1000,1);
P = (1-t)*A + T*B;
Here each row of the array P can be interpreted as a point in 2-dimensions. Those points have the property that they MUST sum to 1. And most importantly, they are clearly uniformly distributed along that line. Any such point on the line is as likely to result as any other, to the extent that the function rand produces truly uniform pseudo-random deviates. That is something The MathWorks has spent a fair amount of effort to ensure happens.
Note that in higher dimensions there are also schemes much like the one I show, however Roger's randfixedsum is well written, fast - simply the best tool to use.
Let me start by saying that I appreciate this discussion and I hope you'll be patient enough with me to continue.
@Matt, to clarify what I was saying, I was referencing x as the total distribution, not as individual points within the distribution. Once generated and scaled, a single distribution would need to remain unchanged, or the sum would change. It would be impossible to scale each point in the distribution as it was generated, because the sum of all generated points would be unknown.
@John, I generated a uniform distribution and scaled it to sum up to 1. I plotted each of these distributions (original and scaled) in their original order to compare shape and density. I then sorted each distribution and plotted them to look for grouping of points towards the mean of the distribution. The plots are posted below. I do not see any change of shape, density, or uniformity. The unsorted plots are of 10,000 points and the sorted plots are of 1,000 (for clear visualization). Is the effect you are describing too subtle to notice at these point densities or am I missing something?
Original order plot. Unscaled on left, scaled on right.
Sorted plot - Unscaled.
Sorted plot - Scaled.
Code that generated the above plots:
points = rand(10000,1);
s = sum(points);
spoints = points ./ s;
scatter(1:10000,points);
figure;
scatter(1:10000,spoints);
points = rand(1000,1);
s = sum(points);
spoints = points ./ s;
p1sort = sort(points);
p2sort = sort(spoints);
scatter(1:1000,p1sort,1);figure;scatter(1:1000,p2sort,1);
Matt J
Matt J le 15 Mai 2014
Modifié(e) : Matt J le 15 Mai 2014
@Ben,
The shape of spoints, when plotted, is not what is germane to the posted topic. The spoints that you've generated is just one vector drawn randomly from the set S={x| sum(x)=1}. The idea of the post is to draw multiple such vectors from S repeatedly and in a uniformly randomized manner (uniformly over S).
If that's the case, I concede the point. I had interpreted the post to be asking for a single vector with uniform distribution and a total sum of 1 derived from a uniform distribution with range (0,1). I was assuming @jimaras was simply asking for a way to convert a uniform distribution (perhaps generated using the rand function) into another uniform distribution with a total sum of 1.
Further, @John stated that my approach does not yield a uniformly distributed result. I suppose this is true if you are trying to maintain uniformity in the (0,1) range, but that did not seem to be his argument. Within the new range of the scaled distribution, I believe I have shown that uniformity is maintained.
I rely on shifting and scaling pseudo-random numbers in some of my work and I felt it was important to understand if my methods were in fact impacting the uniformity of those numbers. So far, it does not seem to be the case.
I appreciate you and @John's willingness to discuss this topic at length.

Connectez-vous pour commenter.

Plus de réponses (1)

You could use rand() to create a uniform distribution then divide each element by the sum.
v = rand(10,1);
vSum = sum(v);
v = v ./ vSum;

3 commentaires

John D'Errico
John D'Errico le 14 Mar 2014
Modifié(e) : John D'Errico le 14 Mar 2014
Except that this does NOT yield a uniformly distributed result. It is a common mistake that people make. See the answer I'm posting for an explanation.
John,
Your answer does not explain why my suggestion would not work. Please read my comment on your answer and explain it for me. I would like to understand why this approach is not valid.
Read my answer, which does show that the simple renormalizing scheme fails to yield a uniform result.
A good way to look at it is if you think of projecting the domain from a square region onto a diagonal straight line crossing the square, you can see that the ends of the line will have fewer points that can contribute to those regions.
Your renormalizing scheme is a terribly common mistake I see made. After all, it is simple, and it seems to get the job done at first glance. It is only when you look more carefully at the actual distribution along the line that people should see it is wrong. Wrong here means non-uniform.

Connectez-vous pour commenter.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by