R embed step_collapse_stringdist 使用 stringdist 的折叠因子级别

step_collapse_stringdist() 创建配方步骤的规范，该规范步骤将折叠之间字符串距离较小的因子级别。

用法

step_collapse_stringdist(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  distance = NULL,
  method = "osa",
  options = list(),
  results = NULL,
  columns = NULL,
  skip = FALSE,
  id = rand_id("collapse_stringdist")
)

参数

recipe: 一个菜谱对象。该步骤将添加到此配方的操作序列中。
...: 一个或多个选择器函数用于选择受该步骤影响的变量。有关更多详细信息，请参阅selections()。对于tidy 方法，当前未使用这些。
role: 由于没有创建新变量，因此此步骤未使用。
trained: 指示预处理数量是否已估计的逻辑。
distance: 整数，确定哪些字符串应该与哪个字符串折叠的值。使用的值包含在内，因此 2 将折叠字符串距离为 2 或更低的级别。
method: 字符，距离计算方法。默认值为 "osa" ，请参阅stringdist::stringdist-metrics 。
options: 列出传递给 stringdist::stringdistmatrix() 的其他参数，例如 weight 、 q 、 p 和 bt ，它们用于 method 的不同值。
results: 一旦通过 prep() 训练此预处理步骤，表示标签折叠方式的列表就会存储在此处。
columns: 将由 terms 参数(最终)填充的变量名称字符串。
skip: 一个合乎逻辑的。当bake() 烘焙食谱时是否应该跳过此步骤？虽然所有操作都是在 prep() 运行时烘焙的，但某些操作可能无法对新数据进行(例如处理结果变量)。使用skip = TRUE时应小心，因为它可能会影响后续操作的计算。
id: 该步骤特有的字符串，用于标识它。

值

recipe 的更新版本，其中新步骤添加到现有步骤(如果有)的序列中。对于 tidy 方法，包含列 terms (将受影响的列)和 base 的小标题。

整理

当您 tidy() 这一步时，会出现一个包含列 "terms" (正在修改的列)、 "from" (旧级别)、 "to" (新级别)和 "id" 的小标题。

箱重

底层操作不允许使用案例权重。

例子

library(recipes)
library(tibble)
data0 <- tibble(
  x1 = c("a", "b", "d", "e", "sfgsfgsd", "hjhgfgjgr"),
  x2 = c("ak", "b", "djj", "e", "hjhgfgjgr", "hjhgfgjgr")
)

rec <- recipe(~., data = data0) %>%
  step_collapse_stringdist(all_predictors(), distance = 1) %>%
  prep()

rec %>%
  bake(new_data = NULL)
#> # A tibble: 6 × 2
#>   x1        x2       
#>   <fct>     <fct>    
#> 1 a         ak       
#> 2 a         b        
#> 3 a         djj      
#> 4 a         b        
#> 5 sfgsfgsd  hjhgfgjgr
#> 6 hjhgfgjgr hjhgfgjgr

tidy(rec, 1)
#> # A tibble: 11 × 4
#>    terms from      to        id                       
#>    <chr> <chr>     <chr>     <chr>                    
#>  1 x1    a         a         collapse_stringdist_q2VL4
#>  2 x1    b         a         collapse_stringdist_q2VL4
#>  3 x1    d         a         collapse_stringdist_q2VL4
#>  4 x1    e         a         collapse_stringdist_q2VL4
#>  5 x1    hjhgfgjgr hjhgfgjgr collapse_stringdist_q2VL4
#>  6 x1    sfgsfgsd  sfgsfgsd  collapse_stringdist_q2VL4
#>  7 x2    ak        ak        collapse_stringdist_q2VL4
#>  8 x2    b         b         collapse_stringdist_q2VL4
#>  9 x2    e         b         collapse_stringdist_q2VL4
#> 10 x2    djj       djj       collapse_stringdist_q2VL4
#> 11 x2    hjhgfgjgr hjhgfgjgr collapse_stringdist_q2VL4

rec <- recipe(~., data = data0) %>%
  step_collapse_stringdist(all_predictors(), distance = 2) %>%
  prep()

rec %>%
  bake(new_data = NULL)
#> # A tibble: 6 × 2
#>   x1        x2       
#>   <fct>     <fct>    
#> 1 a         ak       
#> 2 a         ak       
#> 3 a         djj      
#> 4 a         ak       
#> 5 sfgsfgsd  hjhgfgjgr
#> 6 hjhgfgjgr hjhgfgjgr

tidy(rec, 1)
#> # A tibble: 11 × 4
#>    terms from      to        id                       
#>    <chr> <chr>     <chr>     <chr>                    
#>  1 x1    a         a         collapse_stringdist_NRh52
#>  2 x1    b         a         collapse_stringdist_NRh52
#>  3 x1    d         a         collapse_stringdist_NRh52
#>  4 x1    e         a         collapse_stringdist_NRh52
#>  5 x1    hjhgfgjgr hjhgfgjgr collapse_stringdist_NRh52
#>  6 x1    sfgsfgsd  sfgsfgsd  collapse_stringdist_NRh52
#>  7 x2    ak        ak        collapse_stringdist_NRh52
#>  8 x2    b         ak        collapse_stringdist_NRh52
#>  9 x2    e         ak        collapse_stringdist_NRh52
#> 10 x2    djj       djj       collapse_stringdist_NRh52
#> 11 x2    hjhgfgjgr hjhgfgjgr collapse_stringdist_NRh52

源代码：R/collapse_stringdist.R

相关用法

注：本文由纯净天空筛选整理自Max Kuhn等大神的英文原创作品 collapse factor levels using stringdist。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。