R rsample slide-resampling 基於時間的重采樣

這些重采樣函數專注於各種形式的時間序列重采樣。

sliding_window() 在計算重采樣索引時使用行號。它獨立於任何時間索引，但對於完全規則的序列很有用。
sliding_index() 計算相對於 index 列的重采樣索引。這通常是日期或 POSIXct 列，但不是必須如此。這在對不規則序列進行重新采樣或使用不規則回溯期(例如帶有每日數據的lookback = lubridate::years(1)(一年中的天數可能會有所不同))時非常有用。
sliding_period() 首先根據 period 將 index 分解為更細粒度的組，然後使用它來構造重采樣索引。這對於根據每日數據構建滾動的每月或每年窗口非常有用。

用法

sliding_window(
  data,
  ...,
  lookback = 0L,
  assess_start = 1L,
  assess_stop = 1L,
  complete = TRUE,
  step = 1L,
  skip = 0L
)

sliding_index(
  data,
  index,
  ...,
  lookback = 0L,
  assess_start = 1L,
  assess_stop = 1L,
  complete = TRUE,
  step = 1L,
  skip = 0L
)

sliding_period(
  data,
  index,
  period,
  ...,
  lookback = 0L,
  assess_start = 1L,
  assess_stop = 1L,
  complete = TRUE,
  step = 1L,
  skip = 0L,
  every = 1L,
  origin = NULL
)

參數

data

一個 DataFrame 。

...

這些點用於將來的擴展，並且必須為空。

lookback

計算分析集的重采樣索引時從當前元素向後查找的元素數量。當前行始終包含在分析集中。

對於 sliding_window() ，定義從當前行向後查找的行數的單個整數。
對於 sliding_index() ，將從 index 中減去的單個對象作為 index - lookback 來定義從何處開始搜索要包含在當前重新采樣中的行的邊界。這通常是一個與要回顧的天數相對應的整數值，或者是一個 lubridate period 對象。
對於 sliding_period() ，定義要從當前組回顧的組數的單個整數，其中組是根據 period 分解 index 來定義的。

在所有情況下，Inf 也可以強製擴展窗口。

assess_start, assess_stop

這些參數的組合決定了構建評估集時要展望多遠的未來。它們一起構建 [index + assess_start, index + assess_stop] 範圍來搜索要包含在評估集中的行。

通常，assess_start 始終為 1，以指示可能包含在評估集中的第一個值應在當前行之後開始一個元素，但可以將其增加到更大的值以在分析之間創建 "gaps"如果您擔心短期預測中的高相關性，請使用評估集。

對於 sliding_window() ，這些都是單個整數，定義從當前行向前查找的行數。
對於 sliding_index() ，這些是將添加到 index 的單個對象，以計算搜索要包含在評估集中的行的範圍。這通常是與要展望的天數相對應的整數值，或者是 lubridate period 對象。
對於 sliding_period() ，這些都是單個整數，定義從當前組向前查找的組數，其中組是根據 period 分解 index 來定義的。

complete

單一邏輯。當使用lookback計算分析集時，是否應該隻考慮完整的窗口？如果設置為 FALSE ，則將使用部分窗口，直到可以創建完整窗口(基於 lookback )。這是一種使用擴展窗口到某個點，然後切換到滑動窗口的方法。

step

單個正整數。計算重采樣索引後，step 用於通過用 seq(1L, n_indices, by = step) 子集索引來選擇每個 step 結果來稀疏結果。 step 在 skip 之後應用。請注意，step 與使用的任何時間 index 無關。

skip

單個正整數或零。計算重采樣索引後，將通過用 seq(skip + 1L, n_indices) 子集索引來刪除第一個 skip 結果。當與 lookback = Inf 結合使用時，這會特別有用，它會創建一個從第一行開始的擴展窗口。通過向前跳躍，您可以刪除前幾個數據點很少的窗口。 skip 在 step 之前應用。請注意，skip 與使用的任何時間 index 無關。

index

計算相對於重采樣索引的索引，指定為裸列名稱。這必須是 data 中的現有列。

對於 sliding_index() ，這通常是日期向量，但不是必需的。
對於 sliding_period() ，要求這是一個 Date 或 POSIXct 向量。

index 必須是遞增向量，但允許重複值。此外，索引不能包含任何缺失值。

period

index 分組依據的周期。這被指定為單個字符串，例如 "year" 或 "month" 。有關選項的完整列表和進一步說明，請參閱 slider::slide_period() 的 .period 參數。

every

單個正整數。組合在一起的周期數。

例如，如果將 period 設置為 "year"，且 every 值為 2，則 1970 年和 1971 年將被放置在同一組中。

origin

參考日期時間值。保留為 NULL 時的默認值是索引時區中 1970-01-01 00:00:00 的紀元時間。

這通常用於定義開始計數的錨點時間，當 every 值為 > 1 時相關。

也可以看看

rolling_origin()

slider::slide() 、 slider::slide_index() 和 slider::slide_period() 為這些重采樣器提供動力。

例子

library(vctrs)
#> 
#> Attaching package: ‘vctrs’
#> The following object is masked from ‘package:tibble’:
#> 
#>     data_frame
#> The following object is masked from ‘package:dplyr’:
#> 
#>     data_frame
library(tibble)
library(modeldata)
data("Chicago")

index <- new_date(c(1, 3, 4, 7, 8, 9, 13, 15, 16, 17))
df <- tibble(x = 1:10, index = index)
df
#> # A tibble: 10 × 2
#>        x index     
#>    <int> <date>    
#>  1     1 1970-01-02
#>  2     2 1970-01-04
#>  3     3 1970-01-05
#>  4     4 1970-01-08
#>  5     5 1970-01-09
#>  6     6 1970-01-10
#>  7     7 1970-01-14
#>  8     8 1970-01-16
#>  9     9 1970-01-17
#> 10    10 1970-01-18

# Look back two rows beyond the current row, for a total of three rows
# in each analysis set. Each assessment set is composed of the two rows after
# the current row.
sliding_window(df, lookback = 2, assess_stop = 2)
#> # Sliding window resampling 
#> # A tibble: 6 × 2
#>   splits        id    
#>   <list>        <chr> 
#> 1 <split [3/2]> Slice1
#> 2 <split [3/2]> Slice2
#> 3 <split [3/2]> Slice3
#> 4 <split [3/2]> Slice4
#> 5 <split [3/2]> Slice5
#> 6 <split [3/2]> Slice6

# Same as before, but step forward by 3 rows between each resampling slice,
# rather than just by 1.
rset <- sliding_window(df, lookback = 2, assess_stop = 2, step = 3)
rset
#> # Sliding window resampling 
#> # A tibble: 2 × 2
#>   splits        id    
#>   <list>        <chr> 
#> 1 <split [3/2]> Slice1
#> 2 <split [3/2]> Slice2

analysis(rset$splits[[1]])
#> # A tibble: 3 × 2
#>       x index     
#>   <int> <date>    
#> 1     1 1970-01-02
#> 2     2 1970-01-04
#> 3     3 1970-01-05
analysis(rset$splits[[2]])
#> # A tibble: 3 × 2
#>       x index     
#>   <int> <date>    
#> 1     4 1970-01-08
#> 2     5 1970-01-09
#> 3     6 1970-01-10

# Now slide relative to the `index` column in `df`. This time we look back
# 2 days from the current row's `index` value, and 2 days forward from
# it to construct the assessment set. Note that this series is irregular,
# so it produces different results than `sliding_window()`. Additionally,
# note that it is entirely possible for the assessment set to contain no
# data if you have a highly irregular series and "look forward" into a
# date range where no data points actually exist!
sliding_index(df, index, lookback = 2, assess_stop = 2)
#> # Sliding index resampling 
#> # A tibble: 7 × 2
#>   splits        id    
#>   <list>        <chr> 
#> 1 <split [2/1]> Slice1
#> 2 <split [2/0]> Slice2
#> 3 <split [1/2]> Slice3
#> 4 <split [2/1]> Slice4
#> 5 <split [3/0]> Slice5
#> 6 <split [1/1]> Slice6
#> 7 <split [2/2]> Slice7

# With `sliding_period()`, we can break up our date index into less granular
# chunks, and slide over them instead of the index directly. Here we'll use
# the Chicago data, which contains daily data spanning 16 years, and we'll
# break it up into rolling yearly chunks. Three years worth of data will
# be used for the analysis set, and one years worth of data will be held out
# for performance assessment.
sliding_period(
  Chicago,
  date,
  "year",
  lookback = 2,
  assess_stop = 1
)
#> # Sliding period resampling 
#> # A tibble: 13 × 2
#>    splits             id     
#>    <list>             <chr>  
#>  1 <split [1074/366]> Slice01
#>  2 <split [1096/365]> Slice02
#>  3 <split [1096/365]> Slice03
#>  4 <split [1096/365]> Slice04
#>  5 <split [1095/366]> Slice05
#>  6 <split [1096/365]> Slice06
#>  7 <split [1096/365]> Slice07
#>  8 <split [1096/365]> Slice08
#>  9 <split [1095/366]> Slice09
#> 10 <split [1096/365]> Slice10
#> 11 <split [1096/365]> Slice11
#> 12 <split [1096/365]> Slice12
#> 13 <split [1095/241]> Slice13

# Because `lookback = 2`, three years are required to form a "complete"
# window of data. To allow partial windows, set `complete = FALSE`.
# Here that first constructs two expanding windows until a complete three
# year window can be formed, at which point we switch to a sliding window.
sliding_period(
  Chicago,
  date,
  "year",
  lookback = 2,
  assess_stop = 1,
  complete = FALSE
)
#> # Sliding period resampling 
#> # A tibble: 15 × 2
#>    splits             id     
#>    <list>             <chr>  
#>  1 <split [344/365]>  Slice01
#>  2 <split [709/365]>  Slice02
#>  3 <split [1074/366]> Slice03
#>  4 <split [1096/365]> Slice04
#>  5 <split [1096/365]> Slice05
#>  6 <split [1096/365]> Slice06
#>  7 <split [1095/366]> Slice07
#>  8 <split [1096/365]> Slice08
#>  9 <split [1096/365]> Slice09
#> 10 <split [1096/365]> Slice10
#> 11 <split [1095/366]> Slice11
#> 12 <split [1096/365]> Slice12
#> 13 <split [1096/365]> Slice13
#> 14 <split [1096/365]> Slice14
#> 15 <split [1095/241]> Slice15

# Alternatively, you could break the resamples up by month. Here we'll
# use an expanding monthly window by setting `lookback = Inf`, and each
# assessment set will contain two months of data. To ensure that we have
# enough data to fit our models, we'll `skip` the first 4 expanding windows.
# Finally, to thin out the results, we'll `step` forward by 2 between
# each resample.
sliding_period(
  Chicago,
  date,
  "month",
  lookback = Inf,
  assess_stop = 2,
  skip = 4,
  step = 2
)
#> # Sliding period resampling 
#> # A tibble: 91 × 2
#>    splits           id     
#>    <list>           <chr>  
#>  1 <split [130/61]> Slice01
#>  2 <split [191/61]> Slice02
#>  3 <split [252/61]> Slice03
#>  4 <split [313/62]> Slice04
#>  5 <split [375/59]> Slice05
#>  6 <split [434/61]> Slice06
#>  7 <split [495/61]> Slice07
#>  8 <split [556/61]> Slice08
#>  9 <split [617/61]> Slice09
#> 10 <split [678/62]> Slice10
#> # ℹ 81 more rows

源代碼：R/slide.R

相關用法

注：本文由純淨天空篩選整理自Hannah Frick等大神的英文原創作品 Time-based Resampling。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。