stepwiselm does not respect 'Upper' 'linear' limit during multiple iterations
Afficher commentaires plus anciens
Hi
I am trying to run a regression analysis for a public company by trying to figure out which variables are important to determine the overall sales (using stepwiselm for this), but I don't want interaction terms. To test time lags on the different factors, I take the original raw data and then run multiple calls to stepwiselm with various time lags for each of the factors (the data table is generated in another function and result is stored in variable tab). My ultimate goal is to find the regression equation with the highest adjusted R2.
What I noticed is that when stepwiselm is called multiple times (for example in excess of 400 runs) in succession, it ends up bringing in interaction terms in the final regression equation. This is the call to stepwiselm in my for loop. (Note that I get the same result whether I use a "for" or a "parfor" loop for exection.)
mdl=stepwiselm(tab,'Upper','linear','Verbose',0)
I compile the results from each stepwiselm iteration in a cell array called models (first column holds the model, the third column has the equation). This is one of the results which contains the interaction terms:
>> models{445,1}
ans =
Linear regression model:
HomeSales_4 ~ [Linear formula with 6 terms in 3 predictors]
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ _________
(Intercept) 480.89 135.42 3.5511 0.0007463
HousingStartsTotal_4 -0.0035197 0.0013978 -2.5179 0.014445
NewHomeOrders_4 -0.70127 0.32902 -2.1314 0.037094
HomeBacklog -0.1529 0.083674 -1.8273 0.07255
HousingStartsTotal_4:NewHomeOrders_4 8.9666e-06 3.2585e-06 2.7518 0.0077939
HousingStartsTotal_4:HomeBacklog 2.0849e-06 8.3874e-07 2.4858 0.015682
Number of observations: 67, Error degrees of freedom: 61
Root Mean Squared Error: 41.8
R-squared: 0.652, Adjusted R-Squared: 0.624
F-statistic vs. constant model: 22.9, p-value = 7.53e-13
>> models{445,3}
ans =
"HomeSales_4 ~ 1 + HousingStartsTotal_4*NewHomeOrders_4 + HousingStartsTotal_4*HomeBacklog"
However, if I run the same regression manually (i.e. just one iteration with the same input X and y), stepwiselm does not generate the interaction terms.
>> mdl=stepwiselm(tab2,'Upper','linear','Verbose',0)
mdl =
Linear regression model:
HomeSales_4 ~ 1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog
Estimated Coefficients:
Estimate SE tStat pValue
_________ __________ _______ __________
(Intercept) -82.309 43.518 -1.8914 0.06317
HousingStartsTotal_4 0.0023333 0.00040806 5.718 3.1831e-07
NewHomeOrders_4 0.16748 0.061989 2.7018 0.0088514
HomeBacklog 0.048105 0.014275 3.3698 0.0012884
Number of observations: 67, Error degrees of freedom: 63
Root Mean Squared Error: 47.1
R-squared: 0.544, Adjusted R-Squared: 0.522
F-statistic vs. constant model: 25, p-value = 8.93e-11
I am at a loss as to what is going on. I tried manually defining the equation in the Wilkinson format (instead of using 'Upper', 'linear'), but I still end up with the same results. I appreciate any inputs you may have in to the matter.
Thanks!
7 commentaires
the cyclist
le 17 Oct 2021
That does seem strange!
How reliably reproducible is the behavior? Would you be able to upload a small dataset and code that allows us to reproduce the issue?
James Craig
le 19 Oct 2021
the cyclist
le 20 Oct 2021
I was able to reproduce this on my machine. I used the debugger to step into the MATLAB routines, and there is definitely something odd going on. I can't say that I've reached a complete understanding, but it looks to me as if the code is ignoring the input formula completely, and instead allowing the "upper" terms to include interactions, just the same as if you had simply called it as
stepwiselm(tab)
This definitely seems like a bug to me. I don't see any reported bugs in stepwiselm in R2021b on the bug report page. I would recommend you submit a bug report. (You can do it from that same page.) If you do, please keep us updated here!
James Craig
le 22 Oct 2021
I can't see how this is a bug, and your examples are totally consistent with stepwiselm doc. For the first two commands in your comment above, you still need to set the upper to linear, because by default stepwiselm uses 'interactions':
stepwiselm(tab,'HomeSales_4 ~ -1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog', 'Upper', 'linear') % intercept -1?
stepwiselm(tab,'HomeSales_4 ~ 1 + Inventory_1 + HousingStartsTotal_4 + NewHomeOrders_4 + HomeBacklog', 'Upper', 'linear')
Note that modelspec is only used as the starting model, and it doesn't mean the algorithm is bound to use only linear terms because the starting model contains linear terms!
the cyclist
le 24 Oct 2021
Ahhh, of course.
Sorry for the misinformation!
James Craig
le 26 Oct 2021
Réponses (0)
Catégories
En savoir plus sur Model Building and Assessment dans Centre d'aide et File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!