R rsample make_strata 創建或修改分層變量

該函數可以根據數值數據創建分層，並使非數值數據更有利於分層。

用法

make_strata(x, breaks = 4, nunique = 5, pool = 0.1, depth = 20)

參數

x: 輸入向量。
breaks: 給出對數值分層變量進行分層所需的箱數的單個數字。
nunique: 算法中唯一值閾值數量的整數。
pool: 用於確定特定組是否太小的數據比例，是否應合並到另一個組中。我們不建議將此參數降低到默認值 0.1 以下，因為分層組太小存在危險。
depth: 用於確定應使用的最佳百分位數的整數。箱的數量基於 min(5, floor(n / depth)) 其中 n = length(x) 。如果 x 為數值，則數據集中至少有 40 行(當 depth = 20 時)才能進行分層抽樣。

值

因子向量。

細節

對於數值數據，如果唯一級別的數量小於 nunique ，則數據將被視為分類數據。

對於分類輸入，該函數將查找百分比小於 pool 的數據中出現的 x 級別。這些組中的值將隨機分配給剩餘的層(x 中具有缺失值的數據點也是如此)。

對於具有比 nunique 更多唯一值的數值數據，數據將根據數據的百分位數轉換為分類數據。百分位組每組中的數據不超過 20%。同樣，x 中的缺失值被隨機分配給組。

例子

set.seed(61)
x1 <- rpois(100, lambda = 5)
table(x1)
#> x1
#>  1  2  3  4  5  6  7  8  9 10 11 
#>  3 16  8 19 14 18 11  4  5  1  1 
table(make_strata(x1))
#> 
#>  [1,3]  (3,5]  (5,6] (6,11] 
#>     27     33     18     22 

set.seed(554)
x2 <- rpois(100, lambda = 1)
table(x2)
#> x2
#>  0  1  2  3  4 
#> 36 34 19  6  5 
table(make_strata(x2))
#> 
#>  0  1  2 
#> 38 40 22 

# small groups are randomly assigned
x3 <- factor(x2)
table(x3)
#> x3
#>  0  1  2  3  4 
#> 36 34 19  6  5 
table(make_strata(x3))
#> 
#>  0  1  2 
#> 41 35 24 

# `oilType` data from `caret`
x4 <- rep(LETTERS[1:7], c(37, 26, 3, 7, 11, 10, 2))
table(x4)
#> x4
#>  A  B  C  D  E  F  G 
#> 37 26  3  7 11 10  2 
table(make_strata(x4))
#> 
#>  A  B  E  F 
#> 40 27 14 15 
table(make_strata(x4, pool = 0.1))
#> 
#>  A  B  E  F 
#> 38 29 12 17 
table(make_strata(x4, pool = 0.0))
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
#> 
#>  A  B  C  D  E  F  G 
#> 37 26  3  7 11 10  2 

# not enough data to stratify
x5 <- rnorm(20)
table(make_strata(x5))
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> 
#> strata1 
#>      20 

set.seed(483)
x6 <- rnorm(200)
quantile(x6, probs = (0:10) / 10)
#>         0%        10%        20%        30%        40%        50% 
#> -2.9114060 -1.4508635 -0.9513821 -0.6257852 -0.3286468 -0.0364388 
#>        60%        70%        80%        90%       100% 
#>  0.2027140  0.4278573  0.7050643  1.2471852  2.6792505 
table(make_strata(x6, breaks = 10))
#> 
#>    [-2.91,-1.45]   (-1.45,-0.951]  (-0.951,-0.626]  (-0.626,-0.329] 
#>               20               20               20               20 
#> (-0.329,-0.0364]  (-0.0364,0.203]    (0.203,0.428]    (0.428,0.705] 
#>               20               20               20               20 
#>     (0.705,1.25]      (1.25,2.68] 
#>               20               20

源代碼：R/make_strata.R

相關用法

注：本文由純淨天空篩選整理自Hannah Frick等大神的英文原創作品 Create or Modify Stratification Variables。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。