組 V-fold 交叉驗證根據某些分組變量(可能有多個與之關聯的單行)創建數據拆分。該函數可以創建與分組變量的唯一值一樣多的拆分,也可以創建一組較小的拆分,其中一次會遺漏多個組。這種重采樣的常見用途是當您對同一主題進行重複測量時。
用法
group_vfold_cv(
data,
group = NULL,
v = NULL,
repeats = 1,
balance = c("groups", "observations"),
...,
strata = NULL,
pool = 0.1
)
參數
- data
-
一個 DataFrame 。
- group
-
data
中的變量(單個字符或名稱),用於將具有相同值的觀察結果分組到折疊內的分析或評估集。 - v
-
數據集的分區數。如果保留為
NULL
(默認值),v
將設置為分組變量中唯一值的數量,從而創建 "leave-one-group-out" 拆分。 - repeats
-
重複 V-fold 分區的次數。
- balance
-
如果
v
小於唯一組的數量,那麽應該如何將組組合成折疊?應該是"groups"
之一,它將為每個折疊分配大致相同數量的組,或"observations"
,它將為每個折疊分配大致相同數量的觀察值。 - ...
-
這些點用於將來的擴展,並且必須為空。
- strata
-
data
中的變量(單個字符或名稱)用於進行分層抽樣。如果不是NULL
,則每次重新采樣都會在分層變量中創建。數字strata
被分為四分位數。 - pool
-
用於確定特定組是否太小的數據比例,是否應合並到另一個組中。我們不建議將此參數降低到默認值 0.1 以下,因為分層組太小存在危險。
例子
data(ames, package = "modeldata")
set.seed(123)
group_vfold_cv(ames, group = Neighborhood, v = 5)
#> # Group 5-fold cross-validation
#> # A tibble: 5 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2449/481]> Resample1
#> 2 <split [2642/288]> Resample2
#> 3 <split [2218/712]> Resample3
#> 4 <split [2367/563]> Resample4
#> 5 <split [2044/886]> Resample5
group_vfold_cv(
ames,
group = Neighborhood,
v = 5,
balance = "observations"
)
#> # Group 5-fold cross-validation
#> # A tibble: 5 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2366/564]> Resample1
#> 2 <split [2279/651]> Resample2
#> 3 <split [2361/569]> Resample3
#> 4 <split [2361/569]> Resample4
#> 5 <split [2353/577]> Resample5
group_vfold_cv(ames, group = Neighborhood, v = 5, repeats = 2)
#> # Group 5-fold cross-validation
#> # A tibble: 10 × 3
#> splits id id2
#> <list> <chr> <chr>
#> 1 <split [2077/853]> Repeat1 Resample1
#> 2 <split [2215/715]> Repeat1 Resample2
#> 3 <split [2392/538]> Repeat1 Resample3
#> 4 <split [2574/356]> Repeat1 Resample4
#> 5 <split [2462/468]> Repeat1 Resample5
#> 6 <split [2269/661]> Repeat2 Resample1
#> 7 <split [2426/504]> Repeat2 Resample2
#> 8 <split [2354/576]> Repeat2 Resample3
#> 9 <split [2547/383]> Repeat2 Resample4
#> 10 <split [2124/806]> Repeat2 Resample5
# Leave-one-group-out CV
group_vfold_cv(ames, group = Neighborhood)
#> # Group 28-fold cross-validation
#> # A tibble: 28 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2663/267]> Resample01
#> 2 <split [2779/151]> Resample02
#> 3 <split [2691/239]> Resample03
#> 4 <split [2748/182]> Resample04
#> 5 <split [2928/2]> Resample05
#> 6 <split [2920/10]> Resample06
#> 7 <split [2902/28]> Resample07
#> 8 <split [2837/93]> Resample08
#> 9 <split [2859/71]> Resample09
#> 10 <split [2858/72]> Resample10
#> # ℹ 18 more rows
library(dplyr)
data(Sacramento, package = "modeldata")
city_strata <- Sacramento %>%
group_by(city) %>%
summarize(strata = mean(price)) %>%
summarize(city = city,
strata = cut(strata, quantile(strata), include.lowest = TRUE))
#> Warning: Returning more (or less) than 1 row per `summarise()` group was
#> deprecated in dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that
#> `reframe()` always returns an ungrouped data frame and adjust
#> accordingly.
sacramento_data <- Sacramento %>%
full_join(city_strata, by = "city")
group_vfold_cv(sacramento_data, city, strata = strata)
#> Warning: Leaving `v = NULL` while using stratification will set `v` to the number of groups present in the least common stratum.
#> ℹ Set `v` explicitly to override this warning.
#> # Group 14-fold cross-validation
#> # A tibble: 14 × 2
#> splits id
#> <list> <chr>
#> 1 <split [881/51]> Resample01
#> 2 <split [434/498]> Resample02
#> 3 <split [905/27]> Resample03
#> 4 <split [913/19]> Resample04
#> 5 <split [917/15]> Resample05
#> 6 <split [885/47]> Resample06
#> 7 <split [926/6]> Resample07
#> 8 <split [793/139]> Resample08
#> 9 <split [903/29]> Resample09
#> 10 <split [896/36]> Resample10
#> 11 <split [904/28]> Resample11
#> 12 <split [917/15]> Resample12
#> 13 <split [855/77]> Resample13
#> 14 <split [897/35]> Resample14
相關用法
- R rsample group_mc_cv 小組蒙特卡羅交叉驗證
- R rsample group_bootstraps 團體自舉
- R rsample get_fingerprint 獲取重采樣的標識符
- R rsample get_rsplit 從 rset 中檢索單個 rsplit 對象
- R rsample validation_set 創建驗證拆分以進行調整
- R rsample initial_split 簡單的訓練/測試集分割
- R rsample populate 添加評估指標
- R rsample int_pctl 自舉置信區間
- R rsample vfold_cv V 折交叉驗證
- R rsample rset_reconstruct 使用新的 rset 子類擴展 rsample
- R rsample rolling_origin 滾動原點預測重采樣
- R rsample reverse_splits 反轉分析和評估集
- R rsample labels.rset 從 rset 對象中查找標簽
- R rsample bootstraps 引導抽樣
- R rsample validation_split 創建驗證集
- R rsample reg_intervals 具有線性參數模型的置信區間的便捷函數
- R rsample clustering_cv 集群交叉驗證
- R rsample initial_validation_split 創建初始訓練/驗證/測試拆分
- R rsample loo_cv 留一交叉驗證
- R rsample complement 確定評估樣本
- R rsample slide-resampling 基於時間的重采樣
- R rsample as.data.frame.rsplit 將 rsplit 對象轉換為 DataFrame
- R rsample labels.rsplit 從 rsplit 對象中查找標簽
- R rsample mc_cv 蒙特卡羅交叉驗證
- R rsample tidy.rsplit 整潔的重采樣對象
注:本文由純淨天空篩選整理自Hannah Frick等大神的英文原創作品 Group V-Fold Cross-Validation。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。