R recipes step_classdist_shrunken 計算分類模型的縮小質心距離

step_classdist_shrunken 創建配方步驟的規範，該步驟將數值數據轉換為到正則化類質心的歐幾裏德距離。這是針對分類類變量的每個值完成的。

用法

step_classdist_shrunken(
  recipe,
  ...,
  class = NULL,
  role = NA,
  trained = FALSE,
  threshold = 1/2,
  sd_offset = 1/2,
  log = TRUE,
  prefix = "classdist_",
  keep_original_cols = TRUE,
  objects = NULL,
  skip = FALSE,
  id = rand_id("classdist_shrunken")
)

參數

recipe: 一個菜譜對象。該步驟將添加到此配方的操作序列中。
...: 一個或多個選擇器函數用於為此步驟選擇變量。有關更多詳細信息，請參閱selections()。
class: 指定要用作類的單個分類變量的單個字符串。
role: 由於沒有創建新變量，因此此步驟未使用。
trained: 指示預處理數量是否已估計的邏輯。
threshold: 介於 0 和 1 之間的正則化參數。零表示不使用正則化，一表示質心應縮小到全局質心。
sd_offset: 分位數介於 0 和 1 之間的值，用於穩定合並標準差。
log: 邏輯：距離應該通過自然對數函數進行轉換嗎？
prefix: 生成的新變量的前綴字符串。請參閱下麵的注釋。
keep_original_cols: 將原始變量保留在輸出中的邏輯。默認為 FALSE 。
objects: 一旦 prep() 訓練此步驟，統計數據就會存儲在此處。
skip: 一個合乎邏輯的。當bake() 烘焙食譜時是否應該跳過此步驟？雖然所有操作都是在 prep() 運行時烘焙的，但某些操作可能無法對新數據進行(例如處理結果變量)。使用skip = TRUE時應小心，因為它可能會影響後續操作的計算。
id: 該步驟特有的字符串，用於標識它。

細節

特定類別的質心是使用訓練集中每個類別的數據的每個預測變量的多元平均值。預處理新數據點時，此步驟計算從新點到每個類質心的距離。這些距離特征對於捕獲線性類邊界非常有效。因此，將它們添加到非線性模型中使用的現有預測變量集中非常有用。如果真正的邊界實際上是線性的，則模型將更容易學習訓練數據模式。

縮小的質心使用正則化的形式，其中特定於類的質心收縮到整體class-independent質心。如果預測變量沒有提供信息，則縮小它可能會將其完全移至整體質心。這具有消除預測變量對新距離特征的影響的效果。但是，在許多情況下，它可能不會將所有特定於類的函數移至中心。這意味著某些特征隻會影響特定類別的分類。

threshold 參數可用於優化應使用多少正則化。

step_classdist_shrunken 將為 class 變量的每個唯一值創建一個新列。生成的變量不會替換原始值，並且默認情況下具有前綴 classdist_ 。可以使用prefix 參數更改命名格式。

整理

當您 tidy() 此步驟時，將返回包含 terms (所選選擇器或變量)、value(質心)、class 和 type 列的 tibble。類型具有值 "global" 、 "by_class" 和 "shrunken" 。前兩種質心采用原始單位，而最後一種質心已標準化。

箱重

此步驟執行可以利用案例權重的監督操作。因此，案例權重與頻率權重以及重要性權重一起使用。有關詳細信息，請參閱 case_weights 中的文檔和 tidymodels.org 中的示例。

參考

Tibshirani, R.、Hastie, T.、Narasimhan, B. 和 Chu, G. (2002)。通過基因表達中心縮小來診斷多種癌症類型。美國國家科學院院刊，99(10), 6567-6572。

也可以看看

其他多元變換步驟：step_classdist() , step_depth() , step_geodist() , step_ica() , step_isomap() , step_kpca_poly() , step_kpca_rbf() , step_kpca() , step_mutate_at() , step_nnmf_sparse() , step_nnmf() , step_pca() , step_pls() , step_ratio() , step_spatialsign()

例子

data(penguins, package = "modeldata")
penguins <- penguins[complete.cases(penguins), ]
penguins$island <- NULL
penguins$sex <- NULL

# define naming convention
rec <- recipe(species ~ ., data = penguins) %>%
  step_classdist_shrunken(all_numeric_predictors(),
    class = "species",
    threshold = 1 / 4, prefix = "centroid_"
  )

# default naming
rec <- recipe(species ~ ., data = penguins) %>%
  step_classdist_shrunken(all_numeric_predictors(),
    class = "species",
    threshold = 3 / 4
  )

rec_dists <- prep(rec, training = penguins)

dists_to_species <- bake(rec_dists, new_data = penguins, everything())
## on log scale:
dist_cols <- grep("classdist", names(dists_to_species), value = TRUE)
dists_to_species[, c("species", dist_cols)]
#> # A tibble: 333 × 4
#>    species classdist_Adelie classdist_Gentoo classdist_Chinstrap
#>    <fct>              <dbl>            <dbl>               <dbl>
#>  1 Adelie             1.49            1.72                 1.49 
#>  2 Adelie             1.03            1.35                 1.03 
#>  3 Adelie             1.56            1.93                 1.56 
#>  4 Adelie             1.42            1.78                 1.42 
#>  5 Adelie             1.11            1.48                 1.11 
#>  6 Adelie             1.61            1.86                 1.61 
#>  7 Adelie             0.602           0.0916               0.602
#>  8 Adelie             2.02            2.29                 2.02 
#>  9 Adelie             0.898           1.26                 0.898
#> 10 Adelie             0.756           0.673                0.756
#> # ℹ 323 more rows

tidy(rec, number = 1)
#> # A tibble: 1 × 6
#>   terms                    value class type  threshold id                 
#>   <chr>                    <dbl> <chr> <chr>     <dbl> <chr>              
#> 1 all_numeric_predictors()    NA NA    NA           NA classdist_shrunken…
tidy(rec_dists, number = 1)
#> # A tibble: 36 × 6
#>    terms          value class     type     threshold id                   
#>    <chr>          <dbl> <chr>     <chr>        <dbl> <chr>                
#>  1 bill_length_mm  44.0 Adelie    global        0.75 classdist_shrunken_u…
#>  2 bill_length_mm  38.8 Adelie    by_class      0.75 classdist_shrunken_u…
#>  3 bill_length_mm   0   Adelie    shrunken      0.75 classdist_shrunken_u…
#>  4 bill_length_mm  44.0 Gentoo    global        0.75 classdist_shrunken_u…
#>  5 bill_length_mm  47.6 Gentoo    by_class      0.75 classdist_shrunken_u…
#>  6 bill_length_mm   0   Gentoo    shrunken      0.75 classdist_shrunken_u…
#>  7 bill_length_mm  44.0 Chinstrap global        0.75 classdist_shrunken_u…
#>  8 bill_length_mm  48.8 Chinstrap by_class      0.75 classdist_shrunken_u…
#>  9 bill_length_mm   0   Chinstrap shrunken      0.75 classdist_shrunken_u…
#> 10 bill_depth_mm   17.2 Adelie    global        0.75 classdist_shrunken_u…
#> # ℹ 26 more rows

源代碼：R/classdist_shrunken.R

相關用法

注：本文由純淨天空篩選整理自Max Kuhn等大神的英文原創作品 Compute shrunken centroid distances for classification models。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。