R embed step_woe 证据权重变换

step_woe() 创建配方步骤的规范，该步骤将根据针对二进制结果的证据权重将名义数据转换为其数值转换。

用法

step_woe(
  recipe,
  ...,
  role = "predictor",
  outcome,
  trained = FALSE,
  dictionary = NULL,
  Laplace = 1e-06,
  prefix = "woe",
  keep_original_cols = FALSE,
  skip = FALSE,
  id = rand_id("woe")
)

参数

recipe: 一个菜谱对象。该步骤将添加到此配方的操作序列中。
...: 一个或多个选择器函数用于选择将使用哪些变量来计算组件。有关更多详细信息，请参阅selections()。对于tidy 方法，当前未使用这些。
role: 对于此步骤创建的模型项，应为它们分配什么分析角色？默认情况下，该函数假设由原始变量创建的新 woe 分量列将用作模型中的预测变量。
outcome: vars() 中包含的二进制结果的裸名称。
trained: 指示预处理数量是否已估计的逻辑。
dictionary: 一表格。等级和痛苦值的Map。它必须具有与 dictionary() 返回的输出相同的布局。如果 `NULL`` 该函数将构建一个字典，并将这些变量传递给 ... 。有关详细信息，请参阅dictionary()。
Laplace: 拉普拉斯平滑参数。通常用于避免来自只有一个结果类别的预测变量类别的 -Inf/Inf 的值。设置为 0 以允许 Inf/-Inf。默认值为 1e-6。也称为拉普拉斯平滑技术的 'pseudocount' 参数。
prefix: 将作为结果新变量的前缀的字符串。请参阅下面的注释。
keep_original_cols: 将原始变量保留在输出中的逻辑。默认为 FALSE 。
skip: 一个合乎逻辑的。当recipes::bake() 烘焙食谱时是否应该跳过此步骤？虽然所有操作都是在 recipes::prep() 运行时烘焙的，但某些操作可能无法对新数据进行(例如处理结果变量)。使用skip = TRUE时应小心，因为它可能会影响后续操作的计算
id: 该步骤特有的字符串，用于标识它。

值

recipe 的更新版本，其中新步骤添加到现有步骤(如果有)的序列中。对于 tidy 方法，带有 woe 字典的 tibble 用于将类别与 woe 值映射。

细节

WoE 是一组变量的转换，可产生一组新的特征。公式为

其中 c 从给定名义预测变量 X 的 1 到 C 级别。

这些组件旨在将名义变量转换为数值变量，其顺序和大小反映了与二元结果的关联。要将其应用于数值预测变量，建议在运行 WoE 之前对变量进行离散化。在这里，每个变量都将被二值化，以便稍后关联灾难。这可以通过使用 step_discretize() 来实现。

参数 Laplace 是添加到 1 和 0 比例的一个小量，目的是避免 log(p/0) 或 log(0/p) 结果。数字 woe 版本的名称以 woe_ 开头，后跟变量各自的原始名称。参见《好》(1985)。

可以将自定义 dictionary tibble 传递给 step_woe() 。它必须具有与 dictionary() 的输出相同的结构(参见示例)。如果未提供，它将自动创建。该 tibble 的作用是存储名义预测变量级别与其 woe 值之间的映射。您可能想要调整此对象，以修复一个给定预测变量的级别之间的顺序。一种简单的方法是调整从 dictionary() 返回的输出。

整理

当您 tidy() 这一步时，会出现一个包含列 terms (选择的选择器或变量) value 、 n_tot 、 n_bad 、 n_good 、 p_bad 、 p_good 、 woe 和返回outcome。有关详细信息，请参阅dictionary()。

调整参数

此步骤有 1 个调整参数：

Laplace：拉普拉斯校正(类型：double，默认：1e-06)

箱重

底层操作不允许使用案例权重。

参考

库尔贝克，S.(1959)。信息论和统计学。纽约威利。

Hastie, T.、Tibshirani, R. 和 Friedman, J. (1986)。统计学习的要素，第二版，Springer，2009 年。

Good, I. J. (1985)，“证据权重：简要调查”，贝叶斯统计，2，第 249-270 页。

例子

library(modeldata)
data("credit_data")

set.seed(111)
in_training <- sample(1:nrow(credit_data), 2000)

credit_tr <- credit_data[in_training, ]
credit_te <- credit_data[-in_training, ]

rec <- recipe(Status ~ ., data = credit_tr) %>%
  step_woe(Job, Home, outcome = vars(Status))

woe_models <- prep(rec, training = credit_tr)
#> Warning: Some columns used by `step_woe()` have categories with less than 10 values: 'Home', 'Job'

# the encoding:
bake(woe_models, new_data = credit_te %>% slice(1:5), starts_with("woe"))
#> # A tibble: 5 × 2
#>   woe_Job woe_Home
#>     <dbl>    <dbl>
#> 1  -0.451   0.519 
#> 2   0.187  -0.512 
#> 3  -0.451  -0.512 
#> 4   0.187  -0.512 
#> 5   1.51   -0.0519
# the original data
credit_te %>%
  slice(1:5) %>%
  dplyr::select(Job, Home)
#>         Job    Home
#> 1     fixed    rent
#> 2 freelance   owner
#> 3     fixed   owner
#> 4 freelance   owner
#> 5   partime parents
# the details:
tidy(woe_models, number = 1)
#> # A tibble: 12 × 10
#>    terms value    n_tot n_bad n_good   p_bad  p_good     woe outcome id   
#>    <chr> <chr>    <int> <dbl>  <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>
#>  1 Job   fixed     1261   273    988 0.451   0.708   -0.451  Status  woe_…
#>  2 Job   freelan…   463   159    304 0.263   0.218    0.187  Status  woe_…
#>  3 Job   others      74    39     35 0.0645  0.0251   0.944  Status  woe_…
#>  4 Job   partime    201   133     68 0.220   0.0487   1.51   Status  woe_…
#>  5 Job   NA           1     1      0 0.00165 0       14.7    Status  woe_…
#>  6 Home  ignore       8     4      4 0.00661 0.00287  0.835  Status  woe_…
#>  7 Home  other      161    78     83 0.129   0.0595   0.773  Status  woe_…
#>  8 Home  owner      931   192    739 0.317   0.530   -0.512  Status  woe_…
#>  9 Home  parents    336    98    238 0.162   0.171   -0.0519 Status  woe_…
#> 10 Home  priv       113    42     71 0.0694  0.0509   0.310  Status  woe_…
#> 11 Home  rent       446   188    258 0.311   0.185    0.519  Status  woe_…
#> 12 Home  NA           5     3      2 0.00496 0.00143  1.24   Status  woe_…

# Example of custom dictionary + tweaking
# custom dictionary
woe_dict_custom <- credit_tr %>% dictionary(Job, Home, outcome = "Status")
woe_dict_custom[4, "woe"] <- 1.23 # tweak

# passing custom dict to step_woe()
rec_custom <- recipe(Status ~ ., data = credit_tr) %>%
  step_woe(
    Job, Home,
    outcome = vars(Status), dictionary = woe_dict_custom
  ) %>%
  prep()
#> Warning: Some columns used by `step_woe()` have categories with less than 10 values: 'Home', 'Job'

rec_custom_baked <- bake(rec_custom, new_data = credit_te)
rec_custom_baked %>%
  dplyr::filter(woe_Job == 1.23) %>%
  head()
#> # A tibble: 6 × 14
#>   Seniority  Time   Age Marital Records Expenses Income Assets  Debt
#>       <int> <int> <int> <fct>   <fct>      <int>  <int>  <int> <int>
#> 1         0    48    41 married no            90     80      0     0
#> 2         0    18    21 single  yes           35     50      0     0
#> 3         0    36    23 single  no            45    122   2500     0
#> 4        14    24    51 married no            75    198   1000     0
#> 5         1    60    26 single  no            35    120      0     0
#> 6         1    36    24 married no            76    164      0     0
#> # ℹ 5 more variables: Amount <int>, Price <int>, Status <fct>,
#> #   woe_Job <dbl>, woe_Home <dbl>

源代码：R/woe.R

相关用法

注：本文由纯净天空筛选整理自Max Kuhn等大神的英文原创作品 Weight of evidence transformation。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。