R rsample make_strata 创建或修改分层变量

该函数可以根据数值数据创建分层，并使非数值数据更有利于分层。

用法

make_strata(x, breaks = 4, nunique = 5, pool = 0.1, depth = 20)

参数

x: 输入向量。
breaks: 给出对数值分层变量进行分层所需的箱数的单个数字。
nunique: 算法中唯一值阈值数量的整数。
pool: 用于确定特定组是否太小的数据比例，是否应合并到另一个组中。我们不建议将此参数降低到默认值 0.1 以下，因为分层组太小存在危险。
depth: 用于确定应使用的最佳百分位数的整数。箱的数量基于 min(5, floor(n / depth)) 其中 n = length(x) 。如果 x 为数值，则数据集中至少有 40 行(当 depth = 20 时)才能进行分层抽样。

值

因子向量。

细节

对于数值数据，如果唯一级别的数量小于 nunique ，则数据将被视为分类数据。

对于分类输入，该函数将查找百分比小于 pool 的数据中出现的 x 级别。这些组中的值将随机分配给剩余的层(x 中具有缺失值的数据点也是如此)。

对于具有比 nunique 更多唯一值的数值数据，数据将根据数据的百分位数转换为分类数据。百分位组每组中的数据不超过 20%。同样，x 中的缺失值被随机分配给组。

例子

set.seed(61)
x1 <- rpois(100, lambda = 5)
table(x1)
#> x1
#>  1  2  3  4  5  6  7  8  9 10 11 
#>  3 16  8 19 14 18 11  4  5  1  1 
table(make_strata(x1))
#> 
#>  [1,3]  (3,5]  (5,6] (6,11] 
#>     27     33     18     22 

set.seed(554)
x2 <- rpois(100, lambda = 1)
table(x2)
#> x2
#>  0  1  2  3  4 
#> 36 34 19  6  5 
table(make_strata(x2))
#> 
#>  0  1  2 
#> 38 40 22 

# small groups are randomly assigned
x3 <- factor(x2)
table(x3)
#> x3
#>  0  1  2  3  4 
#> 36 34 19  6  5 
table(make_strata(x3))
#> 
#>  0  1  2 
#> 41 35 24 

# `oilType` data from `caret`
x4 <- rep(LETTERS[1:7], c(37, 26, 3, 7, 11, 10, 2))
table(x4)
#> x4
#>  A  B  C  D  E  F  G 
#> 37 26  3  7 11 10  2 
table(make_strata(x4))
#> 
#>  A  B  E  F 
#> 40 27 14 15 
table(make_strata(x4, pool = 0.1))
#> 
#>  A  B  E  F 
#> 38 29 12 17 
table(make_strata(x4, pool = 0.0))
#> Warning: Stratifying groups that make up 0% of the data may be statistically risky.
#> • Consider increasing `pool` to at least 0.1
#> 
#>  A  B  C  D  E  F  G 
#> 37 26  3  7 11 10  2 

# not enough data to stratify
x5 <- rnorm(20)
table(make_strata(x5))
#> Warning: The number of observations in each quantile is below the recommended threshold of 20.
#> • Stratification will use 1 breaks instead.
#> Warning: Too little data to stratify.
#> • Resampling will be unstratified.
#> 
#> strata1 
#>      20 

set.seed(483)
x6 <- rnorm(200)
quantile(x6, probs = (0:10) / 10)
#>         0%        10%        20%        30%        40%        50% 
#> -2.9114060 -1.4508635 -0.9513821 -0.6257852 -0.3286468 -0.0364388 
#>        60%        70%        80%        90%       100% 
#>  0.2027140  0.4278573  0.7050643  1.2471852  2.6792505 
table(make_strata(x6, breaks = 10))
#> 
#>    [-2.91,-1.45]   (-1.45,-0.951]  (-0.951,-0.626]  (-0.626,-0.329] 
#>               20               20               20               20 
#> (-0.329,-0.0364]  (-0.0364,0.203]    (0.203,0.428]    (0.428,0.705] 
#>               20               20               20               20 
#>     (0.705,1.25]      (1.25,2.68] 
#>               20               20

源代码：R/make_strata.R

相关用法

注：本文由纯净天空筛选整理自Hannah Frick等大神的英文原创作品 Create or Modify Stratification Variables。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。