Fit a curve on scatter data (main behaviuor)

Question

Poria Divshali le 9 Juil 2020

0
Lien

Utiliser le lien direct vers cette question

https://fr.mathworks.com/matlabcentral/answers/562220-fit-a-curve-on-scatter-data-main-behaviuor

Commenté : Poria Divshali le 31 Juil 2020

I need to fit a line to some data, plotted as weighted scatter diagram below. The larger point has bigger weight. I used 'fit' function and the results is shown by blue line in the figure. However, I want to find the Green line, where I beleve is the expected behaviour.

The code and data is attached. I appreciate any idea that help me to reach to fit the green line.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Poria Divshali le 10 Juil 2020

Maybe one possible solution will be using kind of clustering algorithm. Visually, I can see two lines, with different slopes (one Green one and one with lower slope). However, I could not find a good clustering method to seprate these two group

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Answer 1

Image Analyst le 9 Juil 2020

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/562220-fit-a-curve-on-scatter-data-main-behaviuor#answer_463427

I'd first of all filter out known bad data, data that you know should not be included in the fit. I'd make a tentative fit for the green line, like going between the points (20,3000) and (60,11000). Then for any data where the y value is more than 4000 above the line or 4000 below the line, it's throw those out. Then I'd fit a line through what remains. Something like this untested code:

coefficients = polyfit([20, 60], [3000, 11000], 1);
keepers = true(1, length(x));
for k = 1 : length(x)
    % Get the point on the green line for this particular x value.
    yFit = polyval(coefficients, x(k));
    % If it's farther away than 4000, mark it for deletion.
    if abs(yFit - y(k)) > 4000
        keepers(k) = false;
    end
end
% Extract only the good data.
goodx = x(keepers);
goody = y(keepers)
% Now find a fit for only the good data.
goodCoefficients = polyfit(goodx, goody, 1);
yFit = polyval(goodCoefficients, x);
% Plot it.
plot(x, yFit, 'g-', 'LineWidth', 3);

It would be even better to use RANSAC. If you have the Computer Vision Toolbox, you can use fitPolynomialRANSAC(). This type of fit will ignore the other major clump of data between x=40 to 60 and y = 5000 to 6000 and give you a fit for just the larger, longer elliptical cluster that you want.

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Poria Divshali le 10 Juil 2020

Thanks for your suggestion. I need to find a more automatic way to remove considered bad data, since the data will change in different run.

Thanks for the suggestion to use fitPolynomialRANSAC(). I will check it algorithm to find how can I select the appropriate distance for this problem. It may work well.

Connectez-vous pour commenter.

Answer 2

John D'Errico le 9 Juil 2020

1
Lien

Utiliser le lien direct vers cette réponse

https://fr.mathworks.com/matlabcentral/answers/562220-fit-a-curve-on-scatter-data-main-behaviuor#answer_463526

Modifié(e) : John D'Errico le 9 Juil 2020

You "want" the green line. I want the Buffalo Bills to win the Super bowl. Probably not gonna happen in either case. ;-)

You want to solve a weighted linear least squares problem, but you don't really understand linear least squares. And that is the fundamental problem.

plot(WX,WY,'.')
opts = fitoptions( 'Method', 'LinearLeastSquares' );
opts.Weights = W;
f = fit(WX,WY,'poly1',opts);
hold on
plot(f)

Linear least squares looks ONLY at errors in the y variable. Large positve or negative residuls, thus points that fall far above or below the line will have more importance in the fit. They will drag the curve around. In this case, it is a straight line.

Now, consider the green line that you "want" to see produce. Do you see a problem in this context? Down near x==0, ALL of the errors will be above the line, at lest if the green curve were the one we expected. Near x == 100, ALL of the errors will be BELOW the line.

In both cases, the line will be drawn into a position so the slope is reduced. While you see data that makes you WANT something, this is something the data tells me cannot happen. What you WANT is not relevant, because the tool you are using looks only at errors in the y variable. Sadly, I am pretty sure the tool cannot read your mind, nor does it really care about what you want to see happen. Computers do what they are programmed to do.

Worse, we have another problem that is just as serious. your two variables have wildly different variation.

stdx = std(WX)
stdx =
          15.2824457612656
          
stdy = std(WY)
stdy =
          1649.76961766046

We can also see that in the axis scaling.

We can see the vast difference by forcing the two axes to have the same scaling.

axis equal

Can the problem be solved to do what you want to see? For that, you are probably thinking in terms of what is called the total least squares problem. And of course, you have weights, so that will force me to actually write code. And since the two variables have hugely different variations, we need to deal with that too. Bah, humbug. I hate writing code. It makes me think.

The trick for total least squares is to use the SVD to do the work, sometimes called orthogonal regression. Other people call this principal component regression. As you can see here I computed weighted means. Subtract them off the data variables, then I weighted that matrix using W, and rescaled the variables to have unit variances. SVD does the hard work though.

mux = sum(WX.*W)./sum(W);
muy = sum(WY.*W)./sum(W);
A = [(WX - mux).*W/stdx,(WY - muy).*W/stdy];
[U,S,V] = svd(A,0);
S
S =
          240.408960729392                         0
                         0          167.877291251367
>> V
V =
         0.246669782511878        -0.969099591577431
         0.969099591577431         0.246669782511878

We choose the singular vector with the SMALLER singular value. That the two singular values are so close in magnitude tells me this regression is poorly posed, in the sense that the data ended up as a vaguely circular point cloud. Seriously, any line is arguably almost as good as any other in this case.

The weighted orthogonal regression line becomes...

V(1,2)*(x - mux)/stdx + V(2,2)*(y - muy)/stdy = 0

I'll use MATLAB to display the equation in a standard form, mainly becaue I am feeling too lazy to do basic algebra. Just too hot out today.

syms x y
y = vpa(solve(V(1,2)*(x - mux)/stdx + V(2,2)*(y - muy)/stdy,y),5)
y =
424.11*x - 11482.0

Now replot things, and see what happens.

plot(f)
hold on
plot(WX,WY,'b.')
H = fplot(matlabFunction(y),[35,55]);
H.Color = 'g';
H.LineWidth = 3;

So the total least squares regression, using weights and a rescaling of the variables to have unit variances works. Again, it was very close. I could almost have chosen the regression line orthogonal to the one we got. Your data is NOT well posed for regression, as an almost circular point cloud.

The Rolling Stones said it for me. "You can't always get what you want." Of course, you can freely decide the answer is exactly what you want to see. It is a scheme that works well for many politicians these days. :( But today, you got lucky. Que sera, sera... (I know somebody said that, but who? One "day" I'll remember.)

9 commentaires
Afficher 7 commentaires plus anciensMasquer 7 commentaires plus anciens

John D'Errico le 10 Juil 2020

Modifié(e) : John D'Errico le 10 Juil 2020

You misunderstand the entire point of my response.

It seems clear that you misunderstand standard linear regression estimation, because the line you want to see will not have the behavior a regression line produces, and when you tried the regression line, it did not result in what you wanted. Therefore I explained why a regression line will look as it would look, when you try to fit it through a random point cloud.

Therefore, APPARENTLY, it looks as if you are looking at the data as a point cloud, thinking of a line as a line that passes through the middle of that cloud. What do I mean by that? You implicitly expect the regression line to treat errors in BOTH variables the same. So now I need to start over, from the beginning.

Think of a linear regression line as a straight stick passing through the point cloud. To that line, I will attach springs, but special springs that are constrained to stretch ONLY in the y direction. Attach ONE spring to each fixed data point. Springs have the property, that if you double the extension of the spring, then you SQUARE the force applied by the spring. But again, these springs are acting ONLY in the y-direction when you use linear least squares, i.e., that which fit does. Linear least suares assumes the only errors in your data are in y.

The linear fit that results from fit is the fit with a line that equallizes the forces of all springs acting at once, both pulling up as well as down. But those springs ONLY act in the y direction.

For example, I'll create some data here:

x = linspace(-1,1,10);
y = 2.5* x + 3 + randn(size(x));
mdl = fit(x',y','poly1')
mdl = 
     Linear model Poly1:
     mdl(x) = p1*x + p2
     Coefficients (with 95% confidence bounds):
       p1 =       3.334  (2.15, 4.518)
       p2 =       2.831  (2.076, 3.587)
       
plot(mdl)
hold on
plot(x,y,'o')

Do you see how I created the data? Noise was added to y only. x was presumed to be fixed and known. This is the context of the standard linear least squares problem, and that which fit provides.

When you use fit (or polyfit, for that matter) it assumes the noise is ONLY in y. It assumes the numbers in x are fixed and are known perfectly.

However, the plot, with the expected line as fit that you seem to want does not have that character. It looks as if you expect a line fit through the data will allow errors in BOTH the x variable, as well as the y variable. Effectively, you want to fit a model that has springs that are tied to each data point, but springs that pull the line so the overall orthogonal distances of your data to the line are minimized. And here I do mean the orthogonal distance of a point to the line. That is the total least squares model. It is very different from the standard linear least squares problem. I can infer that to be the case, because that just happens to be the line that you drew in green.

THEREFORE. IF you want to find the total least squares regression line, I taught you how to find that line. In fact, it turns out that to solve the total least squares regression, you can do so using a principal components analysis of your data. I taught you to do that using SVD. The solution for the total least squares regression line can be derived from the eigenvector for the minimum eigenvalue, or minimum singular value as obtained from the SVD.

So it turns out that what some people call principal components analysis (PCA) just happens to solve your problem too, if you use it properly. Note that I very carefully did the mathematics using the svd, because you could not just use a default PCA tool and get the desired result. I had to scale the data so that the two variables are treated properly. (A standard PCA tool, where the analysis is applied to the correlation matrix instead of the covariance matrix would have worked as well, but I would need to explain why that was necessary, and I'm sorry, but I did not wish to also teach a course on PCA.)

The point of all this is: you seem to want the total least squares line. The total least squares regression line can be generated from the singular vector corresponding to the smallest singular value. I was careful to generate it by scaling/normalizing the variables along the way, since otherwise, the slope of the line would be dominated by the one variable with hugely more variation than the other.

John D'Errico le 11 Juil 2020

As image analyst points out, the resulting line is of moderately little value in terms of information content. When the point cloud is so roughly circular, the slope of the resulting line is virtually a random number.

Poria Divshali le 31 Juil 2020

You helped me a lot, thanks. Maybe the slop look like a random number in the first look but, it is not. It is what I could expect from theory behind the data I expected. It is electricity market data, y is price and x the total production. I clustered data and find atlest two nice slope. still it is not very accurate but maybe good enough for rough estimation of system behaviour.

For better clustering, I may need to add some other feature, which make the work more complecated :)

Connectez-vous pour commenter.

Fit a curve on scatter data (main behaviuor)

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (2)

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

9 commentaires
Afficher 7 commentaires plus anciensMasquer 7 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

Fit a curve on scatter data (main behaviuor)

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

Réponses (2)

1 commentaire Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

9 commentaires Afficher 7 commentaires plus anciensMasquer 7 commentaires plus anciens

Voir également

Catégories

Tags

Community Treasure Hunt

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

1 commentaire
Afficher -1 commentaires plus anciensMasquer -1 commentaires plus anciens

9 commentaires
Afficher 7 commentaires plus anciensMasquer 7 commentaires plus anciens