Feature extraction is a set of methods that map input features to new output features. Many feature extraction methods use unsupervised learning to extract features. Unlike some feature extraction methods such as PCA and NNMF, the methods described in this section can increase dimensionality (and decrease dimensionality). Internally, the methods involve optimizing nonlinear objective functions. For details, see Sparse Filtering Algorithm or Reconstruction ICA Algorithm.

One typical use of feature extraction is finding features in images. Using these features can lead to improved classification accuracy. For an example, see Feature Extraction Workflow. Another typical use is extracting individual signals from superpositions, which is often termed blind source separation. For an example, see Extract Mixed Signals.

There are two feature extraction functions: `rica`

and `sparsefilt`

.
Associated with these functions are the objects that they create: `ReconstructionICA`

and `SparseFiltering`

.

The sparse filtering algorithm begins with a data matrix `X`

that
has `n`

rows and `p`

columns. Each
row represents one observation and each column represents one measurement.
The columns are also called the features or predictors. The algorithm
then takes either an initial random `p`

-by-`q`

weight
matrix `W`

or uses the weight matrix passed in the `InitialTransformWeights`

name-value
pair. `q`

is the requested number of features that `sparsefilt`

computes.

The algorithm attempts to minimize the Sparse Filtering Objective Function by using a standard limited
memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton optimizer.
See Nocedal and Wright [2]. This optimizer takes up to `IterationLimit`

iterations.
It stops iterating earlier when it takes a step whose norm is less
than `StepTolerance`

, or when it computes that the
norm of the gradient at the current point is less than `GradientTolerance`

times
a scalar *τ*, where

$$\tau =\mathrm{max}\left(1,\mathrm{min}\left(\left|f\right|,{\Vert {g}_{0}\Vert}_{\infty}\right)\right).$$

|*f*| is the norm of the objective function,
and $${\Vert {g}_{0}\Vert}_{\infty}$$ is
the infinity norm of the initial gradient.

The objective function attempts to simultaneously obtain few nonzero features for each data point, and for each resulting feature to have nearly equal weight. To understand how the objective function attempts to achieve these goals, see Ngiam, Koh, Chen, Bhaskar, and Ng [1].

Frequently, you obtain good features by setting a relatively
small value of `IterationLimit`

, from as low as 5
to a few hundred. Allowing the optimizer to continue can result in
overtraining, where the extracted features do not generalize well
to new data.

After constructing a `SparseFiltering`

object, use the `transform`

method to map input data to the
new output features.

To compute an objective function, the sparse filtering algorithm
uses the following steps. The objective function depends on the `n`

-by-`p`

data
matrix `X`

and a weight matrix `W`

that
the optimizer varies. The weight matrix `W`

has dimensions `p`

-by-`q`

,
where `p`

is the number of original features and `q`

is
the number of requested features.

Compute the

`n`

-by-`q`

matrix`X*W`

. Apply the approximate absolute value function $$\varphi \left(u\right)=\sqrt{{u}^{2}+{10}^{-8}}$$ to each element of`X*W`

to obtain the matrix`F`

.*ϕ*is a smooth nonnegative symmetric function that closely approximates the absolute value function.Normalize the columns of

`F`

by the approximate*L*^{2}norm. In other words, define the normalized matrix $$\tilde{F}(i,j)$$ by$$\begin{array}{c}\Vert F(j)\Vert =\sqrt{{\displaystyle \sum _{i=1}^{n}{\left(F(i,j)\right)}^{2}+{10}^{-8}}}\\ \tilde{F}(i,j)=F(i,j)/\Vert F(j)\Vert .\end{array}$$

Normalize the rows of $$\tilde{F}(i,j)$$ by the approximate

*L*^{2}norm. In other words, define the normalized matrix $$\widehat{F}(i,j)$$ by$$\begin{array}{c}\Vert \tilde{F}(i)\Vert =\sqrt{{\displaystyle \sum _{j=1}^{q}{\left(\tilde{F}(i,j)\right)}^{2}+{10}^{-8}}}\\ \widehat{F}(i,j)=\tilde{F}(i,j)/\Vert \tilde{F}(i)\Vert .\end{array}$$

The matrix $$\widehat{F}$$ is the matrix of converted features in

`X`

. Once`sparsefilt`

finds the weights`W`

that minimize the objective function*h*(see below), which the function stores in the output object`Mdl`

in the`Mdl.TransformWeights`

property, the`transform`

function can follow the same transformation steps to convert new data to output features.Compute the objective function

*h*(`W`

) as the 1–norm of the matrix $$\widehat{F}(i,j)$$, meaning the sum of all the elements in the matrix (which are nonnegative by construction):$$h\left(W\right)={\displaystyle \sum _{j=1}^{q}{\displaystyle \sum _{i=1}^{n}\widehat{F}(i,j)}.}$$

If you set the

`Lambda`

name-value pair to a strictly positive value,`sparsefilt`

uses the following modified objective function:$$h\left(W\right)={\displaystyle \sum _{j=1}^{q}{\displaystyle \sum _{i=1}^{n}\widehat{F}(i,j)}}+\lambda {\displaystyle \sum _{j=1}^{q}{w}_{j}^{T}{w}_{j}}.$$

Here,

*w*is the jth column of the matrix_{j}`W`

and*λ*is the value of`Lambda`

. The effect of this term is to shrink the weights`W`

. If you plot the columns of`W`

as images, with positive`Lambda`

these images appear smooth compared to the same images with zero`Lambda`

.

The Reconstruction Independent Component Analysis (RICA) algorithm is based on minimizing an objective function. The algorithm maps input data to output features.

The ICA source model is the following. Each observation *x* is
generated by a random vector *s* according to

$$x=\mu +As.$$

*x*is a column vector of length`p`

.*μ*is a column vector of length`p`

representing a constant term.*s*is a column vector of length`q`

whose elements are zero mean, unit variance random variables that are statistically independent of each other.*A*is a mixing matrix of size`p`

-by-`q`

.

You can use this model in `rica`

to estimate *A* from
observations of *x*. See Extract Mixed Signals.

The RICA algorithm begins with a data matrix `X`

that
has `n`

rows and `p`

columns consisting
of the observations *x _{i}*:

$$X=\left[\begin{array}{c}{x}_{1}^{T}\\ {x}_{2}^{T}\\ \vdots \\ {x}_{n}^{T}\end{array}\right].$$

Each row represents one observation and each column represents
one measurement. The columns are also called the features or predictors.
The algorithm then takes either an initial random `p`

-by-`q`

weight
matrix `W`

or uses the weight matrix passed in the `InitialTransformWeights`

name-value
pair. `q`

is the requested number of features that `rica`

computes.
The weight matrix `W`

is composed of columns *w _{i}* of
size

`p`

-by-1:$$W=\left[\begin{array}{cccc}{w}_{1}& {w}_{2}& \dots & {w}_{q}\end{array}\right].$$

The algorithm attempts to minimize the Reconstruction ICA Objective Function by using a standard
limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton
optimizer. See Nocedal and Wright [2]. This optimizer takes up to `IterationLimit`

iterations.
It stops iterating when it takes a step whose norm is less than `StepTolerance`

,
or when it computes that the norm of the gradient at the current point
is less than `GradientTolerance`

times a scalar *τ*,
where

$$\tau =\mathrm{max}\left(1,\mathrm{min}\left(\left|f\right|,{\Vert {g}_{0}\Vert}_{\infty}\right)\right).$$

|*f*| is the norm of the objective function,
and $${\Vert {g}_{0}\Vert}_{\infty}$$ is
the infinity norm of the initial gradient.

The objective function attempts to obtain a nearly orthonormal weight matrix that minimizes
the sum of elements of *g*(`XW`

), where
*g* is a function (described below) that is applied elementwise
to `XW`

. To understand how the objective function attempts to
achieve these goals, see Le, Karpenko, Ngiam, and Ng [3].

After constructing a `ReconstructionICA`

object, use the `transform`

method to map input data to the
new output features.

The objective function uses a contrast function, which you specify
by using the `ContrastFcn`

name-value pair. The contrast
function is a smooth convex function that is similar to an absolute
value. By default, the contrast function is $$g=\frac{1}{2}\mathrm{log}\left(\mathrm{cosh}\left(2x\right)\right)$$.
For other available contrast functions, see `ContrastFcn`

.

For an `n`

-by-`p`

data matrix `X`

and `q`

output
features, with a regularization parameter *λ* as
the value of the `Lambda`

name-value
pair, the objective function in terms of the `p`

-by-`q`

matrix `W`

is

$$h=\frac{\lambda}{n}{\displaystyle \sum _{i=1}^{n}{\Vert W{W}^{T}{x}_{i}-{x}_{i}\Vert}_{2}^{2}}+\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1}^{q}{\sigma}_{j}g\left({w}_{j}^{T}{x}_{i}\right)}}$$

The *σ _{j}* are known
constants that are ±1. When

`rica`

`NonGaussianityIndicator`

name-value pair.The objective function *h* can have a spurious
minimum of zero when *λ* is zero. Therefore, `rica`

minimizes *h* over *W* that
are normalized to 1. In other words, each column *w _{j}* of

$${w}_{j}=\frac{{v}_{j}}{\sqrt{{v}_{j}^{T}{v}_{j}+{10}^{-8}}}.$$

`rica`

minimizes over the *v _{j}*.
The resulting minimal matrix

`W`

provides the transformation
from input data `X`

to output features `XW`

.[1] Ngiam, Jiquan, Zhenghao Chen, Sonia A. Bhaskar,
Pang W. Koh, and Andrew Y. Ng. “Sparse Filtering.” *Advances
in Neural Information Processing Systems.* Vol. 24, 2011,
pp. 1125–1133. `https://papers.nips.cc/paper/4334-sparse-filtering.pdf`

.

[2] Nocedal, J. and S. J. Wright. *Numerical
Optimization*, Second Edition. Springer Series in Operations
Research, Springer Verlag, 2006.

[3] Le, Quoc V., Alexandre Karpenko, Jiquan
Ngiam, and Andrew Y. Ng. “ICA with Reconstruction Cost for
Efficient Overcomplete Feature Learning.” *Advances
in Neural Information Processing Systems.* Vol. 24, 2011,
pp. 1017–1025. `https://papers.nips.cc/paper/4467-ica-with-reconstruction-cost-for-efficient-overcomplete-feature-learning.pdf`

.

`ReconstructionICA`

| `rica`

| `sparsefilt`

| `SparseFiltering`