Screen Risk Factors by Custom Criteria
This example shows how to use the Screen Risk Factors task to automatically exclude risk factors from a table based on their predictive power. The example also shows how to set up the screening criteria.
Feature selection is an important step in the development of a statistical model. Input data can have hundreds or thousands of variables. Discarding some variables often improves model interpretability, training times, and other important attributes.
In this example, you load the Screen Risk Factors data set, which contains a table of customer information such as age, income, and employment status. You use predefined metrics to assess risk factors individually and analyze the predictive power of each variable relative to a binary response variable. You then select variables automatically or semi-automatically using the Screen Risk Factors task. Finally, you customize the screening criteria that the software uses to assess the risk factors.
Load Data and Predefined Screening Criteria
Load the example data from the ScreenRiskFactorsData
MAT file.
load ScreenRiskFactorsData.mat
Construct predefined screening criteria in your workspace. The software implements these criteria:
For each variable, calculate the information value and the chi-square
p
-value.Compare the values against threshold values to assign a pass, failure, or unclassified status.
The software classifies the data on a worst-of basis, in which the function returns a failure if the status of either the information value or chi-square
p
-value is a failure.
The task also displays the percentage of missing entries. This value does not affect the overall rating.
Generate an ExampleScreeningCriteria
object with the name myCriteria
. This function returns a ScreeningCriteria
object defined by an mrm.data.selection.TestSuiteFactory
function.
import mrm.data.selection.*
myCriteria = ExampleScreeningCriteria;
Launch Screen Risk Factors Task
Open a new live script and launch the Screen Risk Factors task by typing screen
and selecting the task from the dropdown menu.
Alternatively, search for Screen Risk Factors in the Live Task gallery.
The task opens in a reduced view until you specify these required inputs:
Input table must be a table or a timetable. The dropdown shows all such objects in the workspace. For this example, select
data
.Response variable must be one of the the binary variables in the input table. For this example, select
defaultIndicator
.Criteria must be the
ScreeningCriteria
object to apply. For this example, selectmyCriteria
.
Analyze and Remove Risk Factors
The task calculates the screening metrics for each risk factor in the input table and shows the results in the Analyze data variables section. The table contains one row for each variable in the input table.
Status — Overall classification of the variable based on the screening metrics.
Exclude — Option to remove the variable from the data set.
Comment — Reasons for excluding or including the variable.
The task populates the Exclude and Comment columns based on the criteria. In this example, the task automatically excludes the failures and includes the passes with automatically generated comments. The software leaves the undecided risk factors blank for you to analyze. You can overwrite these automatically completed values and sort the table according to any of these columns.
The area under the table is specific to the risk-factor variable and displays the screening metrics, as well as a double histogram that demonstrates how well the variable discriminates between the two possible responses. To switch the view to another variable, click the variable name in the table.
Document with Modelscape Reporting
The live task dynamically produces two outputs:
filteredTable
— Subtable of the input table without the excluded risk factors. Use this subtable in the next step of the model development process.exclusionTable
— Table that includes all the data of the input table together with the exclusion flags and comments in the task. To view this information, check the Preview summary tables box in the Display results section. The software stores this information in theexclusionTable.Properties.CustomProperties
variable.
Use Modelscape Reporting to document the findings of your analysis using the metadata in exclusionTable
. Save the summarized exclusion and progress preview tables with the names ExclusionSummary
and ProgressSummary
, respectively using the summarizeExclusionTable
function.
import mrm.data.filter.*
[ExclusionSummary,ProgressSummary] = summarizeExclusionTable(exclusionTable)
Create holes in a Word document, titled ExclusionSummary
and ProgressSummary
, and insert the corresponding variables from MATLAB into the Word document using fillReportFromWorkspace
. For more information about creating holes and using fillReportFromWorkspace
, see Model Documentation in Modelscape.
Set Up Custom Criteria
To learn about test metrics, thresholds, and handlers that the screening criteria object uses, see Test Metrics in Modelscape and Metrics Handlers.
You can customize the criteria that the Screen Risk Factors task uses to screen variables. The criteria must be in an mrm.data.selection.ScreeningCriteria
object. See the class definition.
edit mrm.data.selection.ScreeningCriteria
This class is a holder for a handle to a function f.
f(inputData,"PredictorVar",varName,"ResponseVar",respVar)
The function f
must be well-defined and produce an mrm.
data.validation
.MetricsHandler
object for any table or timetable inputData
, any predictor variable varName
, and for a binary response variable respVar
. TestSuiteFactory
has this signature for the function call.
To see examples of these functions in the Modelscape package, run
edit mrm.data.selection.ExampleScreeningCriteria; edit mrm.data.selection.TestSuiteFactory; edit mrm.data.selection.overallScreeningStatus;