R textrecipes step_clean_levels 清晰的分類級別

step_clean_levels() 創建配方步驟的規範，該步驟將清理名義數據(字符或因子)，因此級別僅包含字母、數字和下劃線。

用法

step_clean_levels(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  clean = NULL,
  skip = FALSE,
  id = rand_id("clean_levels")
)

參數

recipe: 一個recipe 對象。該步驟將添加到此配方的操作序列中。
...: 一個或多個選擇器函數用於選擇受該步驟影響的變量。有關更多詳細信息，請參閱recipes::selections()。
role: 由於沒有創建新變量，因此此步驟未使用。
trained: 指示預處理數量是否已估計的邏輯。
clean: 用於清理和重新編碼分類級別的命名字符向量。在由 recipes::prep.recipe() 計算之前，這是 NULL 。請注意，如果原始變量是字符向量，它將被轉換為因子。
skip: 一個合乎邏輯的。當recipes::bake.recipe() 烘焙食譜時是否應該跳過此步驟？雖然所有操作都是在 recipes::prep.recipe() 運行時烘焙的，但某些操作可能無法對新數據進行(例如處理結果變量)。使用 skip = FALSE 時應小心。
id: 該步驟特有的字符串，用於標識它。

值

recipe 的更新版本，其中新步驟添加到現有步驟(如果有)的序列中。

細節

新關卡被清理，然後使用 dplyr::recode_factor() 重置。當要處理的數據包含新水平(即不包含在訓練集中)時，它們將被轉換為缺失。

整理

當您 tidy() 此步驟時，將返回一個包含列 terms(選擇的選擇器或變量)、original(原始級別)和 value(已清理的級別)的 tibble。

箱重

底層操作不允許使用案例權重。

也可以看看

step_clean_names() , recipes::step_factor2string() , recipes::step_string2factor() , recipes::step_regex() , recipes::step_unknown() , recipes::step_novel() , recipes::step_other()

文本清理的其他步驟：step_clean_names()

例子

library(recipes)
library(modeldata)
data(Smithsonian)

smith_tr <- Smithsonian[1:15, ]
smith_te <- Smithsonian[16:20, ]

rec <- recipe(~., data = smith_tr)

rec <- rec %>%
  step_clean_levels(name)
rec <- prep(rec, training = smith_tr)

cleaned <- bake(rec, smith_tr)

tidy(rec, number = 1)
#> # A tibble: 15 × 4
#>    terms original                                              value id   
#>    <chr> <chr>                                                 <chr> <chr>
#>  1 name  Anacostia Community Museum                            anac… clea…
#>  2 name  Arthur M. Sackler Gallery                             arth… clea…
#>  3 name  Arts and Industries Building                          arts… clea…
#>  4 name  Cooper Hewitt, Smithsonian Design Museum              coop… clea…
#>  5 name  Freer Gallery of Art                                  free… clea…
#>  6 name  George Gustav Heye Center                             geor… clea…
#>  7 name  Hirshhorn Museum and Sculpture Garden                 hirs… clea…
#>  8 name  National Air and Space Museum                         nati… clea…
#>  9 name  National Museum of African American History and Cult… nati… clea…
#> 10 name  National Museum of African Art                        nati… clea…
#> 11 name  National Museum of American History                   nati… clea…
#> 12 name  National Museum of Natural History                    nati… clea…
#> 13 name  National Museum of the American Indian                nati… clea…
#> 14 name  National Portrait Gallery                             nati… clea…
#> 15 name  Steven F. Udvar-Hazy Center                           stev… clea…

# novel levels are replaced with missing
bake(rec, smith_te)
#> # A tibble: 5 × 3
#>   name  latitude longitude
#>   <fct>    <dbl>     <dbl>
#> 1 NA        38.9     -77.0
#> 2 NA        38.9     -77.0
#> 3 NA        38.9     -77.0
#> 4 NA        38.9     -77.0
#> 5 NA        38.9     -77.1

源代碼：R/clean_levels.R

相關用法

注：本文由純淨天空篩選整理自等大神的英文原創作品 Clean Categorical Levels。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。