Main Content

rankfeatures

Rank key features by class separability criteria

Syntax

``IDX = rankfeatures(X,GROUP)``
``IDX = rankfeatures(X,GROUP,Name=Value)``
``[IDX,Z] = rankfeatures(X,GROUP,___)``

Description

example

````IDX = rankfeatures(X,GROUP)` ranks the features in `X` using an independent evaluation criterion for binary classification. `X` is a matrix where every column is an observed vector and the number of rows corresponds to the original number of features. `GROUP` contains the class labels. `IDX` is a list of indices to the rows of `X` with the most significant features.```

example

````IDX = rankfeatures(X,GROUP,Name=Value)` uses additional options specified by one or more name-value arguments.```

example

````[IDX,Z] = rankfeatures(X,GROUP,___)` also returns a list of absolute values of the criterion used for every feature.```

Examples

collapse all

Find a reduced set of genes that is sufficient for differentiating breast cancer cells from all other types of cancer in the t-matrix NCI60 data set.

Load sample data.

`load NCI60tmatrix`

Get a logical index vector to the breast cancer cells.

`BC = GROUP == 8;`

Select features.

`I = rankfeatures(X,BC,NumberOfIndices=12);`

Test features with a linear discriminant classifier.

```C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate```
```ans = 1 ```

Use cross-correlation weighting to further reduce the required number of genes.

```I = rankfeatures(X,BC,'CCWeighting',0.7,'NumberOfIndices',8); C = classify(X(I,:)',X(I,:)',double(BC)); cp = classperf(BC,C); cp.CorrectRate ```
```ans = 1 ```

Find the discriminant peaks of two groups of signals with Gaussian pulses modulated by two different sources.

Load data.

`load GaussianPulses`

Specify the regional information to outweigh Z-value of features as a function handle. Set the number of output indices to 5.

```f = rankfeatures(y',grp,NWeighting=@(x) x/10+5,NumberOfIndices=5); plot(t,y(grp==1,:),'b',t,y(grp==2,:),'g',t(f),1.35,'vr');```

Input Arguments

collapse all

Sample data, specified as a numeric matrix. Each column is an observed vector, and each row is a feature.

Data Types: `double`

Class labels, specified as a numeric vector, string vector, or cell array of character vectors. `numel(GROUP)` is the same as the number of columns in `X`. `GROUP` must have only two unique values. If it contains any `NaN` values, the function ignores the corresponding observation vector in `X`.

Data Types: `double` | `string` | `cell`

Name-Value Arguments

Specify optional pairs of arguments as `Name1=Value1,...,NameN=ValueN`, where `Name` is the argument name and `Value` is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: ```[idx,x] = rankfeatures(x,groups,Criterion="entrophy",NWeighting=0.2)``` specifies to use the relative entropy as the criterion to assess the feature significance and regional information value of 0.2 to outweigh the Z-value of potential features.

Before R2021a, use commas to separate each name and value, and enclose `Name` in quotes.

Example: ```[idx,x] = rankfeatures(x,groups,'Criterion',"entrophy",'NWeighting',0.2)```

Criterion to assess the significance of each feature for separating two labeled groups, specified as one of the following:

• `"ttest"` — Absolute value two-sample t-test with pooled variance estimate.

• `"entropy"` — Relative entropy, also known as Kullback-Leibler distance or divergence.

• `"bhattacharyya"` — Minimum attainable classification error or Chernoff bound.

• `"roc"` — Area between the empirical receiver operating characteristic (ROC) curve and the random classifier slope.

• `"wilcoxon"` — Absolute value of the standardized u-statistic of a two-sample unpaired Wilcoxon test, also known as Mann-Whitney.

Note

`"ttest"`, `"entropy"`, and `"bhattacharyya"` assume normal distributed classes while `"roc"` and `"wilcoxon"` are nonparametric tests. All tests are feature independent.

Data Types: `char` | `string`

Correlation information to outweigh the Z-value of potential features, specified as a numeric scalar between `0` and `1`.

The function uses $Z×\left(1-\alpha ×\rho \right)$ to calculate the weight, where ρ is the average of the absolute values of the cross-correlation coefficient between the candidate feature and all previously selected features. α is the `CCWeighting` value that sets the weighting factor.

By default, α is `0`, and the function does not weight the potential features. A large value of ρ (close to 1) outweighs the significance statistic, meaning that features are highly correlated with the features already picked are less likely to be included in the output list.

Data Types: `double`

Regional information to outweigh the Z-value of potential features, specified as a nonnegative scalar or function handle.

The function uses $Z×\left(1-{e}^{-{\left(\frac{D}{\beta }\right)}^{2}}\right)$ to calculate the weight, where D is the distance (in rows) between the candidate feature and previously selected features. β is the `NWeighting` value that sets the weighting factor. β must be greater than or equal to `0`.

By default, β is `0`, and the function does not weight the potential features. A small value of D (close to `0`) outweighs the significance statistics of only close features. This means that features that are close to already picked features are less likely to be included in the output list. This option is useful for extracting features from time series with temporal correlation.

β can also be a function of the feature location, specified using `@` or an anonymous function. In both cases `rankfeatures` passes the row position of the feature to the specified function and expects back a value greater than or equal to `0`.

Note

You can use `CCWeighting` and `NWeighting` together.

Data Types: `double` | `function_handle`

Number of output indices in `IDX`, specified as a positive scalar. By default, the number of indices is the same as the number of features when α and β are `0`. Otherwise, the number of indices is set to `20`.

Data Types: `double`

Method for independent normalization across observations for every feature, specified as one of the following:

• `"none"` (default) — No normalization.

• `"meanvar"`${X}_{new}=\frac{X-\mu }{\sigma }$

• `"softmax"`${X}_{new}=\frac{1}{1+{e}^{\left(\frac{\mu -X}{\sigma }\right)}}$

• `"minmax"`${X}_{new}=\frac{X-{X}_{\mathrm{min}}}{{X}_{\mathrm{max}}-{X}_{\mathrm{min}}}$

In these equations, ```μ = mean(X)```, ```σ = std(X)```, `Xmin = min(X)`, and ```Xmax = max(X)```.

Cross-normalization ensures comparability among different features although it is not always necessary because the selected criterion might already account for this.

Data Types: `char` | `string`

Output Arguments

collapse all

List of indices to the rows of X with the most significant features, returned as a numeric vector.

List of absolute values of the `Criterion` used for the features, returned as a numeric vector.

References

[1] Theodoridis, Sergios, and Konstantinos Koutroumbas. Pattern Recognition. San Diego: Academic Press, 1999: 341-342.

[2] Liu, Huan, and Hiroshi Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer International Series in Engineering and Computer Science 454. Boston: Kluwer Academic Publishers, 1998.

[3] Ross, Douglas T., Uwe Scherf, Michael B. Eisen, Charles M. Perou, Christian Rees, Paul Spellman, Vishwanath Iyer, et al. “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines.” Nature Genetics 24, no. 3 (March 2000): 227–35.

Version History

Introduced before R2006a