R recipes step_cut 將數值變量切割為因子

step_cut() 創建配方步驟的規範，該配方步驟根據提供的邊界值將數值變量切割為因子。

用法

step_cut(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  breaks,
  include_outside_range = FALSE,
  skip = FALSE,
  id = rand_id("cut")
)

參數

recipe: 一個菜譜對象。該步驟將添加到此配方的操作序列中。
...: 一個或多個選擇器函數用於為此步驟選擇變量。有關更多詳細信息，請參閱selections()。
role: 由於沒有創建新變量，因此此步驟未使用。
trained: 指示預處理數量是否已估計的邏輯。
breaks: 具有至少一個分割點的數值向量。
include_outside_range: 邏輯，指示訓練集中超出範圍的值是否應包含在最低或最高存儲桶中。默認為 FALSE ，超出原始範圍的值將設置為 NA 。
skip: 一個合乎邏輯的。當bake() 烘焙食譜時是否應該跳過此步驟？雖然所有操作都是在 prep() 運行時烘焙的，但某些操作可能無法對新數據進行(例如處理結果變量)。使用skip = TRUE時應小心，因為它可能會影響後續操作的計算。
id: 該步驟特有的字符串，用於標識它。

值

recipe 的更新版本，將新步驟添加到任何現有操作的序列中。

細節

與 base::cut() 函數不同，無需在中斷中指定最小值和最大值。最低斷點之前的所有值將最終出現在第一個存儲桶中，最後一個斷點之後的所有值將最終出現在最後一個存儲桶中。

step_cut() 將在烘焙步驟中調用 base::cut()，並將 include.lowest 設置為 TRUE 。

箱重

底層操作不允許使用案例權重。

也可以看看

其他離散化步驟：step_discretize()

例子

df <- data.frame(x = 1:10, y = 5:14)
rec <- recipe(df)

# The min and max of the variable are used as boundaries
# if they exceed the breaks
rec %>%
  step_cut(x, breaks = 5) %>%
  prep() %>%
  bake(df)
#> # A tibble: 10 × 2
#>    x          y
#>    <fct>  <int>
#>  1 [1,5]      5
#>  2 [1,5]      6
#>  3 [1,5]      7
#>  4 [1,5]      8
#>  5 [1,5]      9
#>  6 (5,10]    10
#>  7 (5,10]    11
#>  8 (5,10]    12
#>  9 (5,10]    13
#> 10 (5,10]    14

# You can use the same breaks on multiple variables
# then for each variable the boundaries are set separately
rec %>%
  step_cut(x, y, breaks = c(6, 9)) %>%
  prep() %>%
  bake(df)
#> # A tibble: 10 × 2
#>    x      y     
#>    <fct>  <fct> 
#>  1 [1,6]  [5,6] 
#>  2 [1,6]  [5,6] 
#>  3 [1,6]  (6,9] 
#>  4 [1,6]  (6,9] 
#>  5 [1,6]  (6,9] 
#>  6 [1,6]  (9,14]
#>  7 (6,9]  (9,14]
#>  8 (6,9]  (9,14]
#>  9 (6,9]  (9,14]
#> 10 (9,10] (9,14]

# You can keep the original variables using `step_mutate` or
# `step_mutate_at`, for transforming multiple variables at once
rec %>%
  step_mutate(x_orig = x) %>%
  step_cut(x, breaks = 5) %>%
  prep() %>%
  bake(df)
#> # A tibble: 10 × 3
#>    x          y x_orig
#>    <fct>  <int>  <int>
#>  1 [1,5]      5      1
#>  2 [1,5]      6      2
#>  3 [1,5]      7      3
#>  4 [1,5]      8      4
#>  5 [1,5]      9      5
#>  6 (5,10]    10      6
#>  7 (5,10]    11      7
#>  8 (5,10]    12      8
#>  9 (5,10]    13      9
#> 10 (5,10]    14     10

# It is up to you if you want values outside the
# range learned at prep to be included
new_df <- data.frame(x = 1:11, y = 5:15)
rec %>%
  step_cut(x, breaks = 5, include_outside_range = TRUE) %>%
  prep() %>%
  bake(new_df)
#> # A tibble: 11 × 2
#>    x           y
#>    <fct>   <int>
#>  1 [min,5]     5
#>  2 [min,5]     6
#>  3 [min,5]     7
#>  4 [min,5]     8
#>  5 [min,5]     9
#>  6 (5,max]    10
#>  7 (5,max]    11
#>  8 (5,max]    12
#>  9 (5,max]    13
#> 10 (5,max]    14
#> 11 (5,max]    15

rec %>%
  step_cut(x, breaks = 5, include_outside_range = FALSE) %>%
  prep() %>%
  bake(new_df)
#> # A tibble: 11 × 2
#>    x          y
#>    <fct>  <int>
#>  1 [1,5]      5
#>  2 [1,5]      6
#>  3 [1,5]      7
#>  4 [1,5]      8
#>  5 [1,5]      9
#>  6 (5,10]    10
#>  7 (5,10]    11
#>  8 (5,10]    12
#>  9 (5,10]    13
#> 10 (5,10]    14
#> 11 NA        15

源代碼：R/cut.R

相關用法

注：本文由純淨天空篩選整理自Max Kuhn等大神的英文原創作品 Cut a numeric variable into a factor。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。