R embed step_collapse_stringdist 使用 stringdist 的折疊因子級別

step_collapse_stringdist() 創建配方步驟的規範，該規範步驟將折疊之間字符串距離較小的因子級別。

用法

step_collapse_stringdist(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  distance = NULL,
  method = "osa",
  options = list(),
  results = NULL,
  columns = NULL,
  skip = FALSE,
  id = rand_id("collapse_stringdist")
)

參數

recipe: 一個菜譜對象。該步驟將添加到此配方的操作序列中。
...: 一個或多個選擇器函數用於選擇受該步驟影響的變量。有關更多詳細信息，請參閱selections()。對於tidy 方法，當前未使用這些。
role: 由於沒有創建新變量，因此此步驟未使用。
trained: 指示預處理數量是否已估計的邏輯。
distance: 整數，確定哪些字符串應該與哪個字符串折疊的值。使用的值包含在內，因此 2 將折疊字符串距離為 2 或更低的級別。
method: 字符，距離計算方法。默認值為 "osa" ，請參閱stringdist::stringdist-metrics 。
options: 列出傳遞給 stringdist::stringdistmatrix() 的其他參數，例如 weight 、 q 、 p 和 bt ，它們用於 method 的不同值。
results: 一旦通過 prep() 訓練此預處理步驟，表示標簽折疊方式的列表就會存儲在此處。
columns: 將由 terms 參數(最終)填充的變量名稱字符串。
skip: 一個合乎邏輯的。當bake() 烘焙食譜時是否應該跳過此步驟？雖然所有操作都是在 prep() 運行時烘焙的，但某些操作可能無法對新數據進行(例如處理結果變量)。使用skip = TRUE時應小心，因為它可能會影響後續操作的計算。
id: 該步驟特有的字符串，用於標識它。

值

recipe 的更新版本，其中新步驟添加到現有步驟(如果有)的序列中。對於 tidy 方法，包含列 terms (將受影響的列)和 base 的小標題。

整理

當您 tidy() 這一步時，會出現一個包含列 "terms" (正在修改的列)、 "from" (舊級別)、 "to" (新級別)和 "id" 的小標題。

箱重

底層操作不允許使用案例權重。

例子

library(recipes)
library(tibble)
data0 <- tibble(
  x1 = c("a", "b", "d", "e", "sfgsfgsd", "hjhgfgjgr"),
  x2 = c("ak", "b", "djj", "e", "hjhgfgjgr", "hjhgfgjgr")
)

rec <- recipe(~., data = data0) %>%
  step_collapse_stringdist(all_predictors(), distance = 1) %>%
  prep()

rec %>%
  bake(new_data = NULL)
#> # A tibble: 6 × 2
#>   x1        x2       
#>   <fct>     <fct>    
#> 1 a         ak       
#> 2 a         b        
#> 3 a         djj      
#> 4 a         b        
#> 5 sfgsfgsd  hjhgfgjgr
#> 6 hjhgfgjgr hjhgfgjgr

tidy(rec, 1)
#> # A tibble: 11 × 4
#>    terms from      to        id                       
#>    <chr> <chr>     <chr>     <chr>                    
#>  1 x1    a         a         collapse_stringdist_q2VL4
#>  2 x1    b         a         collapse_stringdist_q2VL4
#>  3 x1    d         a         collapse_stringdist_q2VL4
#>  4 x1    e         a         collapse_stringdist_q2VL4
#>  5 x1    hjhgfgjgr hjhgfgjgr collapse_stringdist_q2VL4
#>  6 x1    sfgsfgsd  sfgsfgsd  collapse_stringdist_q2VL4
#>  7 x2    ak        ak        collapse_stringdist_q2VL4
#>  8 x2    b         b         collapse_stringdist_q2VL4
#>  9 x2    e         b         collapse_stringdist_q2VL4
#> 10 x2    djj       djj       collapse_stringdist_q2VL4
#> 11 x2    hjhgfgjgr hjhgfgjgr collapse_stringdist_q2VL4

rec <- recipe(~., data = data0) %>%
  step_collapse_stringdist(all_predictors(), distance = 2) %>%
  prep()

rec %>%
  bake(new_data = NULL)
#> # A tibble: 6 × 2
#>   x1        x2       
#>   <fct>     <fct>    
#> 1 a         ak       
#> 2 a         ak       
#> 3 a         djj      
#> 4 a         ak       
#> 5 sfgsfgsd  hjhgfgjgr
#> 6 hjhgfgjgr hjhgfgjgr

tidy(rec, 1)
#> # A tibble: 11 × 4
#>    terms from      to        id                       
#>    <chr> <chr>     <chr>     <chr>                    
#>  1 x1    a         a         collapse_stringdist_NRh52
#>  2 x1    b         a         collapse_stringdist_NRh52
#>  3 x1    d         a         collapse_stringdist_NRh52
#>  4 x1    e         a         collapse_stringdist_NRh52
#>  5 x1    hjhgfgjgr hjhgfgjgr collapse_stringdist_NRh52
#>  6 x1    sfgsfgsd  sfgsfgsd  collapse_stringdist_NRh52
#>  7 x2    ak        ak        collapse_stringdist_NRh52
#>  8 x2    b         ak        collapse_stringdist_NRh52
#>  9 x2    e         ak        collapse_stringdist_NRh52
#> 10 x2    djj       djj       collapse_stringdist_NRh52
#> 11 x2    hjhgfgjgr hjhgfgjgr collapse_stringdist_NRh52

源代碼：R/collapse_stringdist.R

相關用法

注：本文由純淨天空篩選整理自Max Kuhn等大神的英文原創作品 collapse factor levels using stringdist。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。