R recipes step_cut 将数值变量切割为因子

step_cut() 创建配方步骤的规范，该配方步骤根据提供的边界值将数值变量切割为因子。

用法

step_cut(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  breaks,
  include_outside_range = FALSE,
  skip = FALSE,
  id = rand_id("cut")
)

参数

recipe: 一个菜谱对象。该步骤将添加到此配方的操作序列中。
...: 一个或多个选择器函数用于为此步骤选择变量。有关更多详细信息，请参阅selections()。
role: 由于没有创建新变量，因此此步骤未使用。
trained: 指示预处理数量是否已估计的逻辑。
breaks: 具有至少一个分割点的数值向量。
include_outside_range: 逻辑，指示训练集中超出范围的值是否应包含在最低或最高存储桶中。默认为 FALSE ，超出原始范围的值将设置为 NA 。
skip: 一个合乎逻辑的。当bake() 烘焙食谱时是否应该跳过此步骤？虽然所有操作都是在 prep() 运行时烘焙的，但某些操作可能无法对新数据进行(例如处理结果变量)。使用skip = TRUE时应小心，因为它可能会影响后续操作的计算。
id: 该步骤特有的字符串，用于标识它。

值

recipe 的更新版本，将新步骤添加到任何现有操作的序列中。

细节

与 base::cut() 函数不同，无需在中断中指定最小值和最大值。最低断点之前的所有值将最终出现在第一个存储桶中，最后一个断点之后的所有值将最终出现在最后一个存储桶中。

step_cut() 将在烘焙步骤中调用 base::cut()，并将 include.lowest 设置为 TRUE 。

箱重

底层操作不允许使用案例权重。

也可以看看

其他离散化步骤：step_discretize()

例子

df <- data.frame(x = 1:10, y = 5:14)
rec <- recipe(df)

# The min and max of the variable are used as boundaries
# if they exceed the breaks
rec %>%
  step_cut(x, breaks = 5) %>%
  prep() %>%
  bake(df)
#> # A tibble: 10 × 2
#>    x          y
#>    <fct>  <int>
#>  1 [1,5]      5
#>  2 [1,5]      6
#>  3 [1,5]      7
#>  4 [1,5]      8
#>  5 [1,5]      9
#>  6 (5,10]    10
#>  7 (5,10]    11
#>  8 (5,10]    12
#>  9 (5,10]    13
#> 10 (5,10]    14

# You can use the same breaks on multiple variables
# then for each variable the boundaries are set separately
rec %>%
  step_cut(x, y, breaks = c(6, 9)) %>%
  prep() %>%
  bake(df)
#> # A tibble: 10 × 2
#>    x      y     
#>    <fct>  <fct> 
#>  1 [1,6]  [5,6] 
#>  2 [1,6]  [5,6] 
#>  3 [1,6]  (6,9] 
#>  4 [1,6]  (6,9] 
#>  5 [1,6]  (6,9] 
#>  6 [1,6]  (9,14]
#>  7 (6,9]  (9,14]
#>  8 (6,9]  (9,14]
#>  9 (6,9]  (9,14]
#> 10 (9,10] (9,14]

# You can keep the original variables using `step_mutate` or
# `step_mutate_at`, for transforming multiple variables at once
rec %>%
  step_mutate(x_orig = x) %>%
  step_cut(x, breaks = 5) %>%
  prep() %>%
  bake(df)
#> # A tibble: 10 × 3
#>    x          y x_orig
#>    <fct>  <int>  <int>
#>  1 [1,5]      5      1
#>  2 [1,5]      6      2
#>  3 [1,5]      7      3
#>  4 [1,5]      8      4
#>  5 [1,5]      9      5
#>  6 (5,10]    10      6
#>  7 (5,10]    11      7
#>  8 (5,10]    12      8
#>  9 (5,10]    13      9
#> 10 (5,10]    14     10

# It is up to you if you want values outside the
# range learned at prep to be included
new_df <- data.frame(x = 1:11, y = 5:15)
rec %>%
  step_cut(x, breaks = 5, include_outside_range = TRUE) %>%
  prep() %>%
  bake(new_df)
#> # A tibble: 11 × 2
#>    x           y
#>    <fct>   <int>
#>  1 [min,5]     5
#>  2 [min,5]     6
#>  3 [min,5]     7
#>  4 [min,5]     8
#>  5 [min,5]     9
#>  6 (5,max]    10
#>  7 (5,max]    11
#>  8 (5,max]    12
#>  9 (5,max]    13
#> 10 (5,max]    14
#> 11 (5,max]    15

rec %>%
  step_cut(x, breaks = 5, include_outside_range = FALSE) %>%
  prep() %>%
  bake(new_df)
#> # A tibble: 11 × 2
#>    x          y
#>    <fct>  <int>
#>  1 [1,5]      5
#>  2 [1,5]      6
#>  3 [1,5]      7
#>  4 [1,5]      8
#>  5 [1,5]      9
#>  6 (5,10]    10
#>  7 (5,10]    11
#>  8 (5,10]    12
#>  9 (5,10]    13
#> 10 (5,10]    14
#> 11 NA        15

源代码：R/cut.R

相关用法

注：本文由纯净天空筛选整理自Max Kuhn等大神的英文原创作品 Cut a numeric variable into a factor。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。