R dplyr sample_n 从表中采样 n 行

sample_n()和sample_frac()已被取代slice_sample()。虽然它们不会在不久的将来被弃用，但退役意味着我们只会执行关键的错误修复，因此我们建议转向更新的替代方案。

这些函数被取代是因为我们意识到一个函数有两个互斥的参数比两个单独的函数更方便。这也使其能够解决 sample_n() /sample_frac 的其他一些较小的设计问题：

与 slice() 的联系并不明显。
第一个参数的名称 tbl 与使用 .data 的其他单表动词不一致。
size 参数使用整洁的评估，这是令人惊讶且未记录的。
删除已弃用的 .env 参数更容易。
... 处于次优位置。

用法

sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL, ...)

sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL, ...)

参数

tbl: 一个 DataFrame 。
size: < tidy-select > 对于 sample_n() ，要选择的行数。对于 sample_frac() ，要选择的行的分数。如果tbl 已分组，则size 适用于每个组。
replace: 有或没有更换的样品？
weight: <tidy-select> 采样权重。这必须计算为与输入长度相同的非负数向量。权重自动标准化为总和为 1。
.env: 已弃用。
...: 忽略

例子

df <- tibble(x = 1:5, w = c(0.1, 0.1, 0.1, 2, 2))

# sample_n() -> slice_sample() ----------------------------------------------
# Was:
sample_n(df, 3)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     2   0.1
#> 2     5   2  
#> 3     4   2  
sample_n(df, 10, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     2   0.1
#>  2     2   0.1
#>  3     2   0.1
#>  4     2   0.1
#>  5     1   0.1
#>  6     3   0.1
#>  7     5   2  
#>  8     2   0.1
#>  9     5   2  
#> 10     2   0.1
sample_n(df, 3, weight = w)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     4   2  
#> 2     5   2  
#> 3     2   0.1

# Now:
slice_sample(df, n = 3)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     1   0.1
#> 2     2   0.1
#> 3     4   2  
slice_sample(df, n = 10, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     4   2  
#>  2     1   0.1
#>  3     1   0.1
#>  4     5   2  
#>  5     2   0.1
#>  6     3   0.1
#>  7     2   0.1
#>  8     5   2  
#>  9     2   0.1
#> 10     3   0.1
slice_sample(df, n = 3, weight_by = w)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     4   2  
#> 2     5   2  
#> 3     1   0.1

# Note that sample_n() would error if n was bigger than the group size
# slice_sample() will just use the available rows for consistency with
# the other slice helpers like slice_head()
try(sample_n(df, 10))
#> Error in sample_n(df, 10) : Can't compute indices.
#> Caused by error:
#> ! `size` must be less than or equal to 5 (size of data).
#> ℹ set `replace = TRUE` to use sampling with replacement.
slice_sample(df, n = 10)
#> # A tibble: 5 × 2
#>       x     w
#>   <int> <dbl>
#> 1     1   0.1
#> 2     3   0.1
#> 3     5   2  
#> 4     4   2  
#> 5     2   0.1

# sample_frac() -> slice_sample() -------------------------------------------
# Was:
sample_frac(df, 0.25)
#> # A tibble: 1 × 2
#>       x     w
#>   <int> <dbl>
#> 1     3   0.1
sample_frac(df, 2, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     5   2  
#>  2     5   2  
#>  3     2   0.1
#>  4     1   0.1
#>  5     3   0.1
#>  6     3   0.1
#>  7     1   0.1
#>  8     1   0.1
#>  9     5   2  
#> 10     5   2  

# Now:
slice_sample(df, prop = 0.25)
#> # A tibble: 1 × 2
#>       x     w
#>   <int> <dbl>
#> 1     4     2
slice_sample(df, prop = 2, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     4   2  
#>  2     1   0.1
#>  3     3   0.1
#>  4     3   0.1
#>  5     5   2  
#>  6     5   2  
#>  7     5   2  
#>  8     2   0.1
#>  9     2   0.1
#> 10     4   2

源代码：R/sample.R

相关用法

注：本文由纯净天空筛选整理自Hadley Wickham等大神的英文原创作品 Sample n rows from a table。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。