Force coefficients in multivariate and multiple regression
Afficher commentaires plus anciens
I have a dataset from physical experiments where I control 2 independent variables x and y and I measure 2 dependent variables u and v, and according to a physical model, they depend on parameters p and q as follows:
u = p*q^4*x/y
v = p*q^3*x*y
I make my life a lot easier by converting this to logarithmic scales:
log(u) = log(x) - log(y) + A with 10^A = p*q^4
log(v) = log(x) + log(y) + B with 10^B = p*q^3
And then I can use mvregress, the only problem is: mvregress doens't allow me to force the slopes for x and y to a specific value right out of the box. I really want to force these slopes, which will have an impact on my confidence intervals for A and B. And on top of that: I know I can back calculate the variance of A and B by dividing half of the confidence interval by the t-value (which depends on the confidence level and the degrees of freedom for error), but I am unsure whether A and B will have a covariance and how I can calculate that, because I need that covariance in order to calculate my confidence intervals for p and q.
Réponses (1)
John D'Errico
le 8 Mar 2023
Modifié(e) : John D'Errico
le 8 Mar 2023
There has been much unsaid here that I think you don't understand. Given these models:
u = p*q^4*x/y
v = p*q^3*x*y
where x and y are known independent variables, and u and v are measured, you want to estimate p and q.
You have stated first that you logged the models. In doing so, you made an implicit assumption that you completely missed.
If u and v are measured data, then what is the error structure on the noise in that data? Is the noise additive, Normally distributed? That is, is the true model for these processes:
u + noise = p*q^4*x/y
v + noise = p*q^3*x*y
where the noise is additive gaussian noise? The problem is, when you log the data, it is NOT true that log(a+b) has any simple behavior.
Or, is the noise in u and v really more likely to be proportional noise? If it is, then most likely the noise in your model was something more like lognormally distributed. And that is the good news. It suggests that logging the model is really the proper way to estimate things, because it transforms proportional lognormal noise into additive gaussian noise.
So the very first thing I would suggest is to look at your data. We don't see it, so I cannot help you in that respect. If u and v vary by multiple orders of magnitude, but the noise seems to be roughly the same magitude at all levels of u and v, that would suggest it is additive noise. But if the noise magnitude seems to scale with the size of u and v themselves, that suggests proportional noise, in whic hcase logging the data is a good thing.
The point is, least squares techniques are designed to solve problems where the error structure is gaussian, normally distributed, with a homogeneous variance, so the same at all levels of your data. If that fails, then expect to see poor estimates of your parameters. (Do I need to give an example of what happens, and why logging a model can be right, or wrong? I can, if that would be useful.)
Now. On to the problem at hand. IF you would log those models to estimate p and q...
log(u) = log(p) + 4*log(q) + log(x) - log(y)
log(v) = log(p) + 3*log(q) + log(x) + log(y)
What is known here? We know u,v,x,y. So move EVERYTHING THAT IS KNOWN to one side. We can now write the problem as:
log(u) - log(x) + log(y) = log(p) + 4*log(q) + gaussian noise
log(u) - log(x) - log(y) = log(p) + 3*log(q) + gaussian noise
This reduces to two simple models. The right hand side, IF we knew p and q, would be a simple constant. So the best estimator of that right hand side is just the mean. In fact if we assume the noise if additive normally distributed, we don't need to use any regression tool at all. Just use the mean.
RHS1 = mean(log(u) - log(x) + log(y)); % Best estimatar for: log(p)+4*log(q)
RHS2 = mean(log(u) - log(x) - log(y)); % Best estimatar for: log(p)+3*log(q)
Now we have two estimates. One for log(p)+4*log(q), and a second for log(p)+3*log(q)
That means we can find log(q) as
log(q) = (log(p)+4*log(q)) - (log(p)+3*log(q)) = RHS1 - RHS2
And therefore, the best estimate of p is just
q = exp(RHS1 - RHS2);
Next, we can recover p given that we know the value of q, because we have those same relations. Or, we can get p by taking the linear combination:
4*(log(p)+3*log(q)) = 3*(log(p)+4*log(q)) = log(p)
I'll use the latter approach.
p = exp(4*RHS2 - 3*RHS1);
Having said all of that, We could also have solved thos problem in another way. Consider the ratio:
u./v = q/y.^2
Do you see that p is gone? Only the constant q remains. And, because we used the ratio of those variables, now the proportional error again changes its structure. Still though, the ratio of two lognormal random variables is still lognormally distributed. But now we can estimate q using a regression. Even so, what matters is to understand the error striucture. I'd still probably go back to using the means as I describe above, s long as that is viable.
Again, I NEVER used MVNREGRESS at all. There is absolutely no need, since p and q enter into the model as constant terms there. Again though, the very first thing I would do is to look at the error structure.
4 commentaires
Rogier Delporte
le 8 Mar 2023
John D'Errico
le 8 Mar 2023
You say it is dead easy to estimate p and q, but you did not seem to know how to do that. I'm a little confused. Anyway, IF all you need at this point is time know the uncertainty on p and q, that too is rather simple.
Compute, as I showed you:
RHS1 = mean(log(u) - log(x) + log(y)); % Best estimator for: log(p)+4*log(q)
RHS2 = mean(log(u) - log(x) - log(y)); % Best estimator for: log(p)+3*log(q)
Now, we presume these to be normally distributed. Just compute the variances!
Var1 = var(log(u) - log(x) + log(y)); % The variance of log(p)+4*log(q)
Var2 = var(log(u) - log(x) - log(y)); % The variance of log(p)+3*log(q)
So now, what is the VARIANCE of log(q)? Again, simple. If log(q) is the difference of the two means, then the variance of log(q) is just the sum of those variances. Easy, peasy.
varlogq = Var1 + Var2;
What does this tell us about the variance of q itself? That is more difficult, since exponentiating that will create a lognormal distribution. That is, we see log(q) as having mean = RHS1-RHS2, and we just computed the variance. So now you might compute a 95% confidence interval from the lognormal CDF, based on that mean and variance. At least, this is the easy solution. It would depend on how much data you have. Because really to be statistically appropriate, we need to appreciate why the degrees of freedom come into this problem in confidence intervals. In the end, if you have more than 25 data points or so, I would just use logncdf directly to compute the confidence intervals, since at that point, the t-statistic is not that different from a Normal distribution. But if you have only 8 or 10 data points, or even less, then it will be important to worry about the difference.
We can do the same computations for p. Thus
varlogp = 3*Var1 + 4*Var2;
Note that p will have a bit wider confidence interval than q. Such is life. It depends on your model. And of course, p and q are not independent of each other, but that independence is often ignored when computing confidence intervals.
var(X+Y) = var(X)+var(Y)+2*cov(X,Y)
And I don't think that
log(u) - log(x) + log(y)
and
log(v) - log(x) - log(y)
are independent.
Rogier Delporte
le 9 Mar 2023
Catégories
En savoir plus sur Generalized Linear Regression dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!