R rsample slide-resampling 基于时间的重采样

这些重采样函数专注于各种形式的时间序列重采样。

sliding_window() 在计算重采样索引时使用行号。它独立于任何时间索引，但对于完全规则的序列很有用。
sliding_index() 计算相对于 index 列的重采样索引。这通常是日期或 POSIXct 列，但不是必须如此。这在对不规则序列进行重新采样或使用不规则回溯期(例如带有每日数据的lookback = lubridate::years(1)(一年中的天数可能会有所不同))时非常有用。
sliding_period() 首先根据 period 将 index 分解为更细粒度的组，然后使用它来构造重采样索引。这对于根据每日数据构建滚动的每月或每年窗口非常有用。

用法

sliding_window(
  data,
  ...,
  lookback = 0L,
  assess_start = 1L,
  assess_stop = 1L,
  complete = TRUE,
  step = 1L,
  skip = 0L
)

sliding_index(
  data,
  index,
  ...,
  lookback = 0L,
  assess_start = 1L,
  assess_stop = 1L,
  complete = TRUE,
  step = 1L,
  skip = 0L
)

sliding_period(
  data,
  index,
  period,
  ...,
  lookback = 0L,
  assess_start = 1L,
  assess_stop = 1L,
  complete = TRUE,
  step = 1L,
  skip = 0L,
  every = 1L,
  origin = NULL
)

参数

data

一个 DataFrame 。

...

这些点用于将来的扩展，并且必须为空。

lookback

计算分析集的重采样索引时从当前元素向后查找的元素数量。当前行始终包含在分析集中。

对于 sliding_window() ，定义从当前行向后查找的行数的单个整数。
对于 sliding_index() ，将从 index 中减去的单个对象作为 index - lookback 来定义从何处开始搜索要包含在当前重新采样中的行的边界。这通常是一个与要回顾的天数相对应的整数值，或者是一个 lubridate period 对象。
对于 sliding_period() ，定义要从当前组回顾的组数的单个整数，其中组是根据 period 分解 index 来定义的。

在所有情况下，Inf 也可以强制扩展窗口。

assess_start, assess_stop

这些参数的组合决定了构建评估集时要展望多远的未来。它们一起构建 [index + assess_start, index + assess_stop] 范围来搜索要包含在评估集中的行。

通常，assess_start 始终为 1，以指示可能包含在评估集中的第一个值应在当前行之后开始一个元素，但可以将其增加到更大的值以在分析之间创建 "gaps"如果您担心短期预测中的高相关性，请使用评估集。

对于 sliding_window() ，这些都是单个整数，定义从当前行向前查找的行数。
对于 sliding_index() ，这些是将添加到 index 的单个对象，以计算搜索要包含在评估集中的行的范围。这通常是与要展望的天数相对应的整数值，或者是 lubridate period 对象。
对于 sliding_period() ，这些都是单个整数，定义从当前组向前查找的组数，其中组是根据 period 分解 index 来定义的。

complete

单一逻辑。当使用lookback计算分析集时，是否应该只考虑完整的窗口？如果设置为 FALSE ，则将使用部分窗口，直到可以创建完整窗口(基于 lookback )。这是一种使用扩展窗口到某个点，然后切换到滑动窗口的方法。

step

单个正整数。计算重采样索引后，step 用于通过用 seq(1L, n_indices, by = step) 子集索引来选择每个 step 结果来稀疏结果。 step 在 skip 之后应用。请注意，step 与使用的任何时间 index 无关。

skip

单个正整数或零。计算重采样索引后，将通过用 seq(skip + 1L, n_indices) 子集索引来删除第一个 skip 结果。当与 lookback = Inf 结合使用时，这会特别有用，它会创建一个从第一行开始的扩展窗口。通过向前跳跃，您可以删除前几个数据点很少的窗口。 skip 在 step 之前应用。请注意，skip 与使用的任何时间 index 无关。

index

计算相对于重采样索引的索引，指定为裸列名称。这必须是 data 中的现有列。

对于 sliding_index() ，这通常是日期向量，但不是必需的。
对于 sliding_period() ，要求这是一个 Date 或 POSIXct 向量。

index 必须是递增向量，但允许重复值。此外，索引不能包含任何缺失值。

period

index 分组依据的周期。这被指定为单个字符串，例如 "year" 或 "month" 。有关选项的完整列表和进一步说明，请参阅 slider::slide_period() 的 .period 参数。

every

单个正整数。组合在一起的周期数。

例如，如果将 period 设置为 "year"，且 every 值为 2，则 1970 年和 1971 年将被放置在同一组中。

origin

参考日期时间值。保留为 NULL 时的默认值是索引时区中 1970-01-01 00:00:00 的纪元时间。

这通常用于定义开始计数的锚点时间，当 every 值为 > 1 时相关。

也可以看看

rolling_origin()

slider::slide() 、 slider::slide_index() 和 slider::slide_period() 为这些重采样器提供动力。

例子

library(vctrs)
#> 
#> Attaching package: ‘vctrs’
#> The following object is masked from ‘package:tibble’:
#> 
#>     data_frame
#> The following object is masked from ‘package:dplyr’:
#> 
#>     data_frame
library(tibble)
library(modeldata)
data("Chicago")

index <- new_date(c(1, 3, 4, 7, 8, 9, 13, 15, 16, 17))
df <- tibble(x = 1:10, index = index)
df
#> # A tibble: 10 × 2
#>        x index     
#>    <int> <date>    
#>  1     1 1970-01-02
#>  2     2 1970-01-04
#>  3     3 1970-01-05
#>  4     4 1970-01-08
#>  5     5 1970-01-09
#>  6     6 1970-01-10
#>  7     7 1970-01-14
#>  8     8 1970-01-16
#>  9     9 1970-01-17
#> 10    10 1970-01-18

# Look back two rows beyond the current row, for a total of three rows
# in each analysis set. Each assessment set is composed of the two rows after
# the current row.
sliding_window(df, lookback = 2, assess_stop = 2)
#> # Sliding window resampling 
#> # A tibble: 6 × 2
#>   splits        id    
#>   <list>        <chr> 
#> 1 <split [3/2]> Slice1
#> 2 <split [3/2]> Slice2
#> 3 <split [3/2]> Slice3
#> 4 <split [3/2]> Slice4
#> 5 <split [3/2]> Slice5
#> 6 <split [3/2]> Slice6

# Same as before, but step forward by 3 rows between each resampling slice,
# rather than just by 1.
rset <- sliding_window(df, lookback = 2, assess_stop = 2, step = 3)
rset
#> # Sliding window resampling 
#> # A tibble: 2 × 2
#>   splits        id    
#>   <list>        <chr> 
#> 1 <split [3/2]> Slice1
#> 2 <split [3/2]> Slice2

analysis(rset$splits[[1]])
#> # A tibble: 3 × 2
#>       x index     
#>   <int> <date>    
#> 1     1 1970-01-02
#> 2     2 1970-01-04
#> 3     3 1970-01-05
analysis(rset$splits[[2]])
#> # A tibble: 3 × 2
#>       x index     
#>   <int> <date>    
#> 1     4 1970-01-08
#> 2     5 1970-01-09
#> 3     6 1970-01-10

# Now slide relative to the `index` column in `df`. This time we look back
# 2 days from the current row's `index` value, and 2 days forward from
# it to construct the assessment set. Note that this series is irregular,
# so it produces different results than `sliding_window()`. Additionally,
# note that it is entirely possible for the assessment set to contain no
# data if you have a highly irregular series and "look forward" into a
# date range where no data points actually exist!
sliding_index(df, index, lookback = 2, assess_stop = 2)
#> # Sliding index resampling 
#> # A tibble: 7 × 2
#>   splits        id    
#>   <list>        <chr> 
#> 1 <split [2/1]> Slice1
#> 2 <split [2/0]> Slice2
#> 3 <split [1/2]> Slice3
#> 4 <split [2/1]> Slice4
#> 5 <split [3/0]> Slice5
#> 6 <split [1/1]> Slice6
#> 7 <split [2/2]> Slice7

# With `sliding_period()`, we can break up our date index into less granular
# chunks, and slide over them instead of the index directly. Here we'll use
# the Chicago data, which contains daily data spanning 16 years, and we'll
# break it up into rolling yearly chunks. Three years worth of data will
# be used for the analysis set, and one years worth of data will be held out
# for performance assessment.
sliding_period(
  Chicago,
  date,
  "year",
  lookback = 2,
  assess_stop = 1
)
#> # Sliding period resampling 
#> # A tibble: 13 × 2
#>    splits             id     
#>    <list>             <chr>  
#>  1 <split [1074/366]> Slice01
#>  2 <split [1096/365]> Slice02
#>  3 <split [1096/365]> Slice03
#>  4 <split [1096/365]> Slice04
#>  5 <split [1095/366]> Slice05
#>  6 <split [1096/365]> Slice06
#>  7 <split [1096/365]> Slice07
#>  8 <split [1096/365]> Slice08
#>  9 <split [1095/366]> Slice09
#> 10 <split [1096/365]> Slice10
#> 11 <split [1096/365]> Slice11
#> 12 <split [1096/365]> Slice12
#> 13 <split [1095/241]> Slice13

# Because `lookback = 2`, three years are required to form a "complete"
# window of data. To allow partial windows, set `complete = FALSE`.
# Here that first constructs two expanding windows until a complete three
# year window can be formed, at which point we switch to a sliding window.
sliding_period(
  Chicago,
  date,
  "year",
  lookback = 2,
  assess_stop = 1,
  complete = FALSE
)
#> # Sliding period resampling 
#> # A tibble: 15 × 2
#>    splits             id     
#>    <list>             <chr>  
#>  1 <split [344/365]>  Slice01
#>  2 <split [709/365]>  Slice02
#>  3 <split [1074/366]> Slice03
#>  4 <split [1096/365]> Slice04
#>  5 <split [1096/365]> Slice05
#>  6 <split [1096/365]> Slice06
#>  7 <split [1095/366]> Slice07
#>  8 <split [1096/365]> Slice08
#>  9 <split [1096/365]> Slice09
#> 10 <split [1096/365]> Slice10
#> 11 <split [1095/366]> Slice11
#> 12 <split [1096/365]> Slice12
#> 13 <split [1096/365]> Slice13
#> 14 <split [1096/365]> Slice14
#> 15 <split [1095/241]> Slice15

# Alternatively, you could break the resamples up by month. Here we'll
# use an expanding monthly window by setting `lookback = Inf`, and each
# assessment set will contain two months of data. To ensure that we have
# enough data to fit our models, we'll `skip` the first 4 expanding windows.
# Finally, to thin out the results, we'll `step` forward by 2 between
# each resample.
sliding_period(
  Chicago,
  date,
  "month",
  lookback = Inf,
  assess_stop = 2,
  skip = 4,
  step = 2
)
#> # Sliding period resampling 
#> # A tibble: 91 × 2
#>    splits           id     
#>    <list>           <chr>  
#>  1 <split [130/61]> Slice01
#>  2 <split [191/61]> Slice02
#>  3 <split [252/61]> Slice03
#>  4 <split [313/62]> Slice04
#>  5 <split [375/59]> Slice05
#>  6 <split [434/61]> Slice06
#>  7 <split [495/61]> Slice07
#>  8 <split [556/61]> Slice08
#>  9 <split [617/61]> Slice09
#> 10 <split [678/62]> Slice10
#> # ℹ 81 more rows

源代码：R/slide.R

相关用法

注：本文由纯净天空筛选整理自Hannah Frick等大神的英文原创作品 Time-based Resampling。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。