R dplyr sample_n 從表中采樣 n 行

sample_n()和sample_frac()已被取代slice_sample()。雖然它們不會在不久的將來被棄用，但退役意味著我們隻會執行關鍵的錯誤修複，因此我們建議轉向更新的替代方案。

這些函數被取代是因為我們意識到一個函數有兩個互斥的參數比兩個單獨的函數更方便。這也使其能夠解決 sample_n() /sample_frac 的其他一些較小的設計問題：

與 slice() 的聯係並不明顯。
第一個參數的名稱 tbl 與使用 .data 的其他單表動詞不一致。
size 參數使用整潔的評估，這是令人驚訝且未記錄的。
刪除已棄用的 .env 參數更容易。
... 處於次優位置。

用法

sample_n(tbl, size, replace = FALSE, weight = NULL, .env = NULL, ...)

sample_frac(tbl, size = 1, replace = FALSE, weight = NULL, .env = NULL, ...)

參數

tbl: 一個 DataFrame 。
size: < tidy-select > 對於 sample_n() ，要選擇的行數。對於 sample_frac() ，要選擇的行的分數。如果tbl 已分組，則size 適用於每個組。
replace: 有或沒有更換的樣品？
weight: <tidy-select> 采樣權重。這必須計算為與輸入長度相同的非負數向量。權重自動標準化為總和為 1。
.env: 已棄用。
...: 忽略

例子

df <- tibble(x = 1:5, w = c(0.1, 0.1, 0.1, 2, 2))

# sample_n() -> slice_sample() ----------------------------------------------
# Was:
sample_n(df, 3)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     2   0.1
#> 2     5   2  
#> 3     4   2  
sample_n(df, 10, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     2   0.1
#>  2     2   0.1
#>  3     2   0.1
#>  4     2   0.1
#>  5     1   0.1
#>  6     3   0.1
#>  7     5   2  
#>  8     2   0.1
#>  9     5   2  
#> 10     2   0.1
sample_n(df, 3, weight = w)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     4   2  
#> 2     5   2  
#> 3     2   0.1

# Now:
slice_sample(df, n = 3)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     1   0.1
#> 2     2   0.1
#> 3     4   2  
slice_sample(df, n = 10, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     4   2  
#>  2     1   0.1
#>  3     1   0.1
#>  4     5   2  
#>  5     2   0.1
#>  6     3   0.1
#>  7     2   0.1
#>  8     5   2  
#>  9     2   0.1
#> 10     3   0.1
slice_sample(df, n = 3, weight_by = w)
#> # A tibble: 3 × 2
#>       x     w
#>   <int> <dbl>
#> 1     4   2  
#> 2     5   2  
#> 3     1   0.1

# Note that sample_n() would error if n was bigger than the group size
# slice_sample() will just use the available rows for consistency with
# the other slice helpers like slice_head()
try(sample_n(df, 10))
#> Error in sample_n(df, 10) : Can't compute indices.
#> Caused by error:
#> ! `size` must be less than or equal to 5 (size of data).
#> ℹ set `replace = TRUE` to use sampling with replacement.
slice_sample(df, n = 10)
#> # A tibble: 5 × 2
#>       x     w
#>   <int> <dbl>
#> 1     1   0.1
#> 2     3   0.1
#> 3     5   2  
#> 4     4   2  
#> 5     2   0.1

# sample_frac() -> slice_sample() -------------------------------------------
# Was:
sample_frac(df, 0.25)
#> # A tibble: 1 × 2
#>       x     w
#>   <int> <dbl>
#> 1     3   0.1
sample_frac(df, 2, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     5   2  
#>  2     5   2  
#>  3     2   0.1
#>  4     1   0.1
#>  5     3   0.1
#>  6     3   0.1
#>  7     1   0.1
#>  8     1   0.1
#>  9     5   2  
#> 10     5   2  

# Now:
slice_sample(df, prop = 0.25)
#> # A tibble: 1 × 2
#>       x     w
#>   <int> <dbl>
#> 1     4     2
slice_sample(df, prop = 2, replace = TRUE)
#> # A tibble: 10 × 2
#>        x     w
#>    <int> <dbl>
#>  1     4   2  
#>  2     1   0.1
#>  3     3   0.1
#>  4     3   0.1
#>  5     5   2  
#>  6     5   2  
#>  7     5   2  
#>  8     2   0.1
#>  9     2   0.1
#> 10     4   2

源代碼：R/sample.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Sample n rows from a table。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。