## Recursive Algorithms for Online Parameter Estimation

The recursive estimation algorithms in the System Identification Toolbox™ can be separated into two categories:

• Infinite-history algorithms — These algorithms aim to minimize the error between the observed and predicted outputs for all time steps from the beginning of the simulation. The System Identification Toolbox supports infinite-history estimation in:

• Recursive command-line estimators for the least-squares linear regression, AR, ARX, ARMA, ARMAX, OE, and BJ model structures

• Simulink® Recursive Least Squares Estimator and Recursive Polynomial Model Estimator blocks

• Finite-history algorithms — These algorithms aim to minimize the error between the observed and predicted outputs for a finite number of past time steps. The toolbox supports finite-history estimation for linear-in-parameters models:

• Recursive command-line estimators for the least-squares linear regression, AR, ARX, and OE model structures

• Simulink Recursive Least Squares Estimator block

• Simulink Recursive Polynomial Model Estimator block, for AR, ARX, and OE structures only

Finite-history algorithms are typically easier to tune than the infinite-history algorithms when the parameters have rapid and potentially large variations over time.

### Recursive Infinite-History Estimation

#### General Form of Infinite-History Recursive Estimation

The general form of the infinite-history recursive estimation algorithm is as follows:

`$\stackrel{^}{\theta }\left(t\right)=\stackrel{^}{\theta }\left(t-1\right)+K\left(t\right)\left(y\left(t\right)-\stackrel{^}{y}\left(t\right)\right)$`

$\stackrel{^}{\theta }\left(t\right)$ is the parameter estimate at time t. y(t) is the observed output at time t, and $\stackrel{^}{y}\left(t\right)$ is the prediction of y(t) based on observations up to time t-1. The gain, K(t), determines how much the current prediction error $y\left(t\right)-\stackrel{^}{y}\left(t\right)$ affects the update of the parameter estimate. The estimation algorithms minimize the prediction-error term $y\left(t\right)-\stackrel{^}{y}\left(t\right)$.

The gain has the following form:

`$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$`

The recursive algorithms supported by the System Identification Toolbox product differ based on different approaches for choosing the form of Q(t) and computing $\psi \left(t\right)$. Here, $\psi \left(t\right)$ represents the gradient of the predicted model output $\stackrel{^}{y}\left(t|\theta \right)$ with respect to the parameters $\theta$.

The simplest way to visualize the role of the gradient $\psi \left(t\right)$ of the parameters, is to consider models with a linear-regression form:

`$y\left(t\right)={\psi }^{T}\left(t\right){\theta }_{0}\left(t\right)+e\left(t\right)$`

In this equation, $\psi \left(t\right)$ is the regression vector that is computed based on previous values of measured inputs and outputs. ${\theta }_{0}\left(t\right)$ represents the true parameters. e(t) is the noise source (innovations), which is assumed to be white noise. The specific form of $\psi \left(t\right)$ depends on the structure of the polynomial model.

For linear regression equations, the predicted output is given by the following equation:

`$\stackrel{^}{y}\left(t\right)={\psi }^{T}\left(t\right)\stackrel{^}{\theta }\left(t-1\right)$`

For models that do not have the linear regression form, it is not possible to compute exactly the predicted output and the gradient $\psi \left(t\right)$ for the current parameter estimate $\stackrel{^}{\theta }\left(t-1\right)$. To learn how you can compute approximation for $\psi \left(t\right)$ and $\stackrel{^}{\theta }\left(t-1\right)$ for general model structures, see the section on recursive prediction-error methods in [1].

#### Types of Infinite-History Recursive Estimation Algorithms

The System Identification Toolbox software provides the following infinite-history recursive estimation algorithms for online estimation:

The forgetting factor and Kalman Filter formulations are more computationally intensive than gradient and unnormalized gradient methods. However, they typically have better convergence properties.

Forgetting Factor.  The following set of equations summarizes the forgetting factor adaptation algorithm:

`$\stackrel{^}{\theta }\left(t\right)=\stackrel{^}{\theta }\left(t-1\right)+K\left(t\right)\left(y\left(t\right)-\stackrel{^}{y}\left(t\right)\right)$`
`$\stackrel{^}{y}\left(t\right)={\psi }^{T}\left(t\right)\stackrel{^}{\theta }\left(t-1\right)$`
`$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$`
`$Q\left(t\right)=\frac{P\left(t-1\right)}{\lambda +{\psi }^{T}\left(t\right)P\left(t-1\right)\psi \left(t\right)}$`
`$P\left(t\right)=\frac{1}{\lambda }\left(P\left(t-1\right)-\frac{P\left(t-1\right)\psi \left(t\right)\psi {\left(t\right)}^{T}P\left(t-1\right)}{\lambda +\psi {\left(t\right)}^{T}P\left(t-1\right)\psi \left(t\right)}\right)$`

The software ensures P(t) is a positive-definite matrix by using a square-root algorithm to update it [2]. The software computes `P` assuming that the residuals (difference between estimated and measured outputs) are white noise, and the variance of these residuals is 1. R2`/2` * `P` is approximately equal to the covariance matrix of the estimated parameters, where R2 is the true variance of the residuals.

Q(t) is obtained by minimizing the following function at time t:

`${\sum }_{k=1}^{t}{\lambda }^{t-k}{\left(y\left(k\right)-\stackrel{^}{y}\left(k\right)\right)}^{2}$`

See section 11.2 in [1] for details.

This approach discounts old measurements exponentially such that an observation that is $\tau$ samples old carries a weight that is equal to ${\lambda }^{\tau }$ times the weight of the most recent observation. $\tau =1}{1-\lambda }$ represents the memory horizon of this algorithm. Measurements older than $\tau =1}{1-\lambda }$ typically carry a weight that is less than about 0.3.

$\lambda$ is called the forgetting factor and typically has a positive value between `0.98` and `0.995`. Set $\lambda =1$ to estimate time-invariant (constant) parameters. Set $\lambda <1$ to estimate time-varying parameters.

Note

The forgetting factor algorithm for $\lambda$ = 1 is equivalent to the Kalman filter algorithm with R1=0 and R2=1. For more information about the Kalman filter algorithm, see Kalman Filter.

Kalman Filter.  The following set of equations summarizes the Kalman filter adaptation algorithm:

`$\stackrel{^}{\theta }\left(t\right)=\stackrel{^}{\theta }\left(t-1\right)+K\left(t\right)\left(y\left(t\right)-\stackrel{^}{y}\left(t\right)\right)$`
`$\stackrel{^}{y}\left(t\right)={\psi }^{T}\left(t\right)\stackrel{^}{\theta }\left(t-1\right)$`
`$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$`
`$Q\left(t\right)=\frac{P\left(t-1\right)}{{R}_{2}+{\psi }^{T}\left(t\right)P\left(t-1\right)\psi \left(t\right)}$`
`$P\left(t\right)=P\left(t-1\right)+{R}_{1}-\frac{P\left(t-1\right)\psi \left(t\right)\psi {\left(t\right)}^{T}P\left(t-1\right)}{{R}_{2}+\psi {\left(t\right)}^{T}P\left(t-1\right)\psi \left(t\right)}$`

The software ensures P(t) is a positive-definite matrix by using a square-root algorithm to update it [2]. The software computes `P` assuming that the residuals (difference between estimated and measured outputs) are white noise, and the variance of these residuals is 1. R2* `P` is the covariance matrix of the estimated parameters, and R1 /R2 is the covariance matrix of the parameter changes. Where, R1 is the covariance matrix of parameter changes that you specify.

This formulation assumes the linear-regression form of the model:

`$y\left(t\right)={\psi }^{T}\left(t\right){\theta }_{0}\left(t\right)+e\left(t\right)$`

Q(t) is computed using a Kalman filter.

This formulation also assumes that the true parameters ${\theta }_{0}\left(t\right)$ are described by a random walk:

`${\theta }_{0}\left(t\right)={\theta }_{0}\left(t-1\right)+w\left(t\right)$`

w(t) is Gaussian white noise with the following covariance matrix, or drift matrix R1:

`$Ew\left(t\right){w}^{T}\left(t\right)={R}_{1}$`

R2 is the variance of the innovations e(t) in the following equation:

`$y\left(t\right)={\psi }^{T}\left(t\right){\theta }_{0}\left(t\right)+e\left(t\right)$`

The Kalman filter algorithm is entirely specified by the sequence of data y(t), the gradient $\psi \left(t\right)$, R1, R2, and the initial conditions $\theta \left(t=0\right)$ (initial guess of the parameters) and $P\left(t=0\right)$ (covariance matrix that indicates parameters errors).

Note

It is assumed that R1 and P(t = 0) matrices are scaled such that R2 = 1. This scaling does not affect the parameter estimates.

Normalized and Unnormalized Gradient.  In the linear regression case, the gradient methods are also known as the least mean squares (LMS) methods.

`$\stackrel{^}{\theta }\left(t\right)=\stackrel{^}{\theta }\left(t-1\right)+K\left(t\right)\left(y\left(t\right)-\stackrel{^}{y}\left(t\right)\right)$`
`$\stackrel{^}{y}\left(t\right)={\psi }^{T}\left(t\right)\stackrel{^}{\theta }\left(t-1\right)$`
`$K\left(t\right)=Q\left(t\right)\psi \left(t\right)$`

In the unnormalized gradient approach, Q(t) is given by:

`$Q\left(t\right)=\gamma$`

In the normalized gradient approach, Q(t) is given by:

`$Q\left(t\right)=\frac{\gamma }{{|\psi \left(t\right)|}^{2}+Bias}$`

The normalized gradient algorithm scales the adaptation gain, γ, at each step by the square of the two-norm of the gradient vector. If the gradient is close to zero, this can cause jumps in the estimated parameters. To prevent these jumps, a bias term is introduced in the scaling factor.

These choices of Q(t) for the gradient algorithms update the parameters in the negative gradient direction, where the gradient is computed with respect to the parameters. See pg. 372 in [1] for details.

### Recursive Finite-History Estimation

The finite-history estimation methods find parameter estimates θ(t) by minimizing

`$\sum _{k=t-N+1}^{t}{\left(y\left(k\right)-\stackrel{^}{y}\left(k|\theta \right)\right)}^{2},$`

where y(k) is the observed output at time k, and $\stackrel{^}{y}\left(k|\theta \right)$ is the predicted output at time k. This approach is also known as sliding-window estimation. Finite-history estimation approaches minimize prediction errors for the last N time steps. In contrast, infinite-history estimation methods minimize prediction errors starting from the beginning of the simulation.

The System Identification Toolbox supports finite-history estimation for the linear-in-parameters models (AR and ARX) where predicted output has the form $\stackrel{^}{y}\left(k|\theta \right)=\Psi \left(k\right)\theta \left(k-1\right)$. The software constructs and maintains a buffer of regressors ψ(k) and observed outputs y(k) for k = t-N+1, t-N+2, … , t-2, t-1, t. These buffers contain the necessary matrices for the underlying linear regression problem of minimizing ${‖{\Psi }_{buffer}\theta -{y}_{buffer}‖}_{2}^{2}$ over θ. The software solves this linear regression problem using QR factoring with column pivoting.

## References

[1] Ljung, L. System Identification: Theory for the User. Upper Saddle River, NJ: Prentice-Hall PTR, 1999.

[2] Carlson, N.A. "Fast triangular formulation of the square root filter." AIAA Journal, Vol. 11, Number 9, 1973, pp. 1259-1265.

[3] Zhang, Q. "Some Implementation Aspects of Sliding Window Least Squares Algorithms." IFAC Proceedings. Vol. 33, Issue 15, 2000, pp. 763-768.