IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Feature Selection

    Giuseppe Casalicchio发表于 2025-05-29 00:00:00
    love 0
    [This article was first published on mlr-org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Goal

    After this exercise, you should understand and be able to perform feature selection using wrapper functions with mlr3fselect. You should also be able to integrate various performance measures and calculate the generalization error.

    Wrapper Methods

    In addition to filtering, wrapper methods are another variant of selecting features. While in filtering conditions for the feature values are set, in wrapper methods the learner is applied to different subsets of the feature set. As models need to be refitted, this method is computationally expensive.

    For wrapper methods, we need the package mlr3fselect, at whose heart the following R6 classes are:

    • FSelectInstanceSingleCrit, FSelectInstanceMultiCrit: These two classes describe the feature selection problem and store the results.
    • FSelector: This class is the base class for implementations of feature selection algorithms.

    Prerequisites

    We load the most important packages and use a fixed seed for reproducibility.

    library(mlr3verse)
    library(data.table)
    library(mlr3fselect)
    set.seed(7891)

    In this exercise, we will use the german_credit data and the learner classif.ranger:

    task_gc = tsk("german_credit")
    lrn_ranger = lrn("classif.ranger")

    1 Basic Application

    1.1 Create the Framework

    Create an FSelectInstanceSingleCrit object using fsi(). The instance should use a 3-fold cross validation, classification accuracy as the measure and terminate after 20 evaluations. For simplification only consider the features age, amount, credit_history and duration.

    Hint 1:
    task_gc$select(...)
    
    instance = fsi(
      task = ...,
      learner = ...,
      resampling = ...,
      measure = ...,
      terminator = ...
    )
    Solution

    1.2 Start the Feature Selection

    Start the feature selection step by selecting sequential using the FSelector class via fs() and pass the FSelectInstanceSingleCrit object to the $optimize() method of the initialized FSelector object.

    Hint 1:
    fselector = fs(...)
    Hint 2:
    fselector = fs(...)
    fselector$optimize(...)
    Solution

    1.3 Evaluate

    View the four characteristics and the accuracy from the instance archive for each of the first two batches.

    Hint 1:
    instance$archive$data[...]
    Hint 2:
    instance$archive$data[batch_nr == ..., ...]
    Solution

    1.4 Model Training

    Which feature(s) should be selected? Train the model.

    Hint 1:

    Compare the accuracy values for the different feature combinations and select the feature(s) accordingly.

    Hint 2:
    task_gc = ...
    task_gc$select(...)
    lrn_ranger$train(...)
    Solution

    2 Multiple Performance Measures

    To optimize multiple performance metrics, the same steps must be followed as above except that multiple metrics are passed. Create an ´instance´ object as above considering the measures classif.tpr and classif.tnr. For the second step use random search and take a look at the results in a third step.

    We again use the german_credit data:

    task_gc = tsk("german_credit")
    Hint 1:
    instance = fsi(...)
    fselector = fs(...)
    fselector$...(...)
    features = unlist(lapply(...))
    cbind(features,...)
    Solution

    3 Nested Resampling

    Nested resampling enables finding unbiased performance estimators for the selection of features. In mlr3 this is possible with the class AutoFSelector, whose instance can be created by the function auto_fselector().

    3.1 Create an AutoFSelector Instance

    Implement an AutoFSelector object that uses random search to find a feature selection that gives the highest accuracy for a logistic regression with holdout resampling. It should terminate after 10 evaluations.

    Hint 1:
    afs = auto_fselector(
      fselector = ...,
      learner = ...,
      resampling = ...,
      measure = ...,
      terminator = ...
    )
    Solution

    3.2 Benchmark

    Compare the AutoFSelector with a normal logistic regression using 3 fold cross-validation.

    Hint 1:

    The AutoFSelector inherits from the Learner base class, which is why it can be used like any other learner.

    Hint 2:

    Implement a benchmark grid and aggregate the result.

    Solution

    Summary

    • Wrapper methods calculate performance measures for various combinations of features in order to perform feature selection.
    • They are computationally expensive since several models need to be fitted.
    • The AutoFSelector inherits from the Learner base class, which is why it can be used like any other learner.
    To leave a comment for the author, please follow the link and comment on their blog: mlr-org.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Feature Selection


沪ICP备19023445号-2号
友情链接