当前位置: 首页>>代码示例 >>用法及示例精选 >>正文


R rsample group_vfold_cv V 组交叉验证


组 V-fold 交叉验证根据某些分组变量(可能有多个与之关联的单行)创建数据拆分。该函数可以创建与分组变量的唯一值一样多的拆分,也可以创建一组较小的拆分,其中一次会遗漏多个组。这种重采样的常见用途是当您对同一主题进行重复测量时。

用法

group_vfold_cv(
  data,
  group = NULL,
  v = NULL,
  repeats = 1,
  balance = c("groups", "observations"),
  ...,
  strata = NULL,
  pool = 0.1
)

参数

data

一个 DataFrame 。

group

data 中的变量(单个字符或名称),用于将具有相同值的观察结果分组到折叠内的分析或评估集。

v

数据集的分区数。如果保留为 NULL(默认值),v 将设置为分组变量中唯一值的数量,从而创建 "leave-one-group-out" 拆分。

repeats

重复 V-fold 分区的次数。

balance

如果v小于唯一组的数量,那么应该如何将组组合成折叠?应该是 "groups" 之一,它将为每个折叠分配大致相同数量的组,或 "observations" ,它将为每个折叠分配大致相同数量的观察值。

...

这些点用于将来的扩展,并且必须为空。

strata

data 中的变量(单个字符或名称)用于进行分层抽样。如果不是 NULL ,则每次重新采样都会在分层变量中创建。数字 strata 被分为四分位数。

pool

用于确定特定组是否太小的数据比例,是否应合并到另一个组中。我们不建议将此参数降低到默认值 0.1 以下,因为分层组太小存在危险。

带有类 group_vfold_cvrsettbl_dftbldata.frame 的 tibble。结果包括数据分割对象的列和标识变量。

例子

data(ames, package = "modeldata")

set.seed(123)
group_vfold_cv(ames, group = Neighborhood, v = 5)
#> # Group 5-fold cross-validation 
#> # A tibble: 5 × 2
#>   splits             id       
#>   <list>             <chr>    
#> 1 <split [2449/481]> Resample1
#> 2 <split [2642/288]> Resample2
#> 3 <split [2218/712]> Resample3
#> 4 <split [2367/563]> Resample4
#> 5 <split [2044/886]> Resample5
group_vfold_cv(
  ames,
  group = Neighborhood,
  v = 5,
  balance = "observations"
)
#> # Group 5-fold cross-validation 
#> # A tibble: 5 × 2
#>   splits             id       
#>   <list>             <chr>    
#> 1 <split [2366/564]> Resample1
#> 2 <split [2279/651]> Resample2
#> 3 <split [2361/569]> Resample3
#> 4 <split [2361/569]> Resample4
#> 5 <split [2353/577]> Resample5
group_vfold_cv(ames, group = Neighborhood, v = 5, repeats = 2)
#> # Group 5-fold cross-validation 
#> # A tibble: 10 × 3
#>    splits             id      id2      
#>    <list>             <chr>   <chr>    
#>  1 <split [2077/853]> Repeat1 Resample1
#>  2 <split [2215/715]> Repeat1 Resample2
#>  3 <split [2392/538]> Repeat1 Resample3
#>  4 <split [2574/356]> Repeat1 Resample4
#>  5 <split [2462/468]> Repeat1 Resample5
#>  6 <split [2269/661]> Repeat2 Resample1
#>  7 <split [2426/504]> Repeat2 Resample2
#>  8 <split [2354/576]> Repeat2 Resample3
#>  9 <split [2547/383]> Repeat2 Resample4
#> 10 <split [2124/806]> Repeat2 Resample5

# Leave-one-group-out CV
group_vfold_cv(ames, group = Neighborhood)
#> # Group 28-fold cross-validation 
#> # A tibble: 28 × 2
#>    splits             id        
#>    <list>             <chr>     
#>  1 <split [2663/267]> Resample01
#>  2 <split [2779/151]> Resample02
#>  3 <split [2691/239]> Resample03
#>  4 <split [2748/182]> Resample04
#>  5 <split [2928/2]>   Resample05
#>  6 <split [2920/10]>  Resample06
#>  7 <split [2902/28]>  Resample07
#>  8 <split [2837/93]>  Resample08
#>  9 <split [2859/71]>  Resample09
#> 10 <split [2858/72]>  Resample10
#> # ℹ 18 more rows

library(dplyr)
data(Sacramento, package = "modeldata")

city_strata <- Sacramento %>%
  group_by(city) %>%
  summarize(strata = mean(price)) %>%
  summarize(city = city,
            strata = cut(strata, quantile(strata), include.lowest = TRUE))
#> Warning: Returning more (or less) than 1 row per `summarise()` group was
#> deprecated in dplyr 1.1.0.
#> ℹ Please use `reframe()` instead.
#> ℹ When switching from `summarise()` to `reframe()`, remember that
#>   `reframe()` always returns an ungrouped data frame and adjust
#>   accordingly.

sacramento_data <- Sacramento %>%
  full_join(city_strata, by = "city")

group_vfold_cv(sacramento_data, city, strata = strata)
#> Warning: Leaving `v = NULL` while using stratification will set `v` to the number of groups present in the least common stratum.
#> ℹ Set `v` explicitly to override this warning.
#> # Group 14-fold cross-validation 
#> # A tibble: 14 × 2
#>    splits            id        
#>    <list>            <chr>     
#>  1 <split [881/51]>  Resample01
#>  2 <split [434/498]> Resample02
#>  3 <split [905/27]>  Resample03
#>  4 <split [913/19]>  Resample04
#>  5 <split [917/15]>  Resample05
#>  6 <split [885/47]>  Resample06
#>  7 <split [926/6]>   Resample07
#>  8 <split [793/139]> Resample08
#>  9 <split [903/29]>  Resample09
#> 10 <split [896/36]>  Resample10
#> 11 <split [904/28]>  Resample11
#> 12 <split [917/15]>  Resample12
#> 13 <split [855/77]>  Resample13
#> 14 <split [897/35]>  Resample14
源代码:R/vfold.R

相关用法


注:本文由纯净天空筛选整理自Hannah Frick等大神的英文原创作品 Group V-Fold Cross-Validation。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。