R recipes step_window 移动窗口函数

step_window() 创建配方步骤的规范，该步骤将创建新列，这些新列是跨移动窗口计算统计数据的函数的结果。

用法

step_window(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  size = 3,
  na_rm = TRUE,
  statistic = "mean",
  columns = NULL,
  names = NULL,
  keep_original_cols = TRUE,
  skip = FALSE,
  id = rand_id("window")
)

参数

recipe: 一个菜谱对象。该步骤将添加到此配方的操作序列中。
...: 一个或多个选择器函数用于为此步骤选择变量。有关更多详细信息，请参阅selections()。
role: 对于此步骤创建的模型项，应为其分配什么分析角色？如果 names 保留为 NULL ，则滚动统计信息将替换原始列，并且角色保持不变。如果设置了names，则这些新列将具有NULL 的角色，除非此参数具有值。
trained: 指示预处理数量是否已估计的逻辑。
size: 窗口大小的奇数整数>= 3。
na_rm: 是否应从每个窗口内的计算中删除缺失值的逻辑。
statistic: 应该为每个移动窗口计算的统计类型的字符串。可能的值为：'max' , 'mean' , 'median' , 'min' , 'prod' , 'sd' , 'sum' , 'var'
columns: 所选变量名称的字符串。该字段是一个占位符，一旦使用 prep() 就会被填充。
names: 可选字符串，其长度与 terms 选择的术语数相同。如果您不确定将选择哪些列，请使用summary 函数(请参见下面的示例)。这些将是该步骤创建的新列的名称。
keep_original_cols: 将原始变量保留在输出中的逻辑。默认为 FALSE 。
skip: 一个合乎逻辑的。当bake() 烘焙食谱时是否应该跳过此步骤？虽然所有操作都是在 prep() 运行时烘焙的，但某些操作可能无法对新数据进行(例如处理结果变量)。使用skip = TRUE时应小心，因为它可能会影响后续操作的计算。
id: 该步骤特有的字符串，用于标识它。

值

recipe 的更新版本，将新步骤添加到任何现有操作的序列中。

细节

计算使用一种有点非典型的方法来处理滚动统计的开始和结束部分。该过程从中心对齐窗口计算开始，并且滚动值的开始和结束部分分别使用第一个和最后一个滚动值来确定。例如，如果具有 12 个值的列 x 使用 5 点移动中值进行平滑，则前三个平滑值由 median(x[1:5]) 估计，第四个使用 median(x[2:6]) 。

如果指定了 names，keep_original_cols 也适用于此步骤。

步骤将停止并显示有关安装包的注释。

整理

当您执行 tidy() 此步骤时，将返回一个包含 terms(选择的选择器或变量)、statistic(汇总函数名称)和 size 列的 tibble。

调整参数

此步骤有 2 个调整参数：

statistic：滚动汇总统计(类型：字符，默认值：平均值)
size ：窗口大小(类型：整数，默认值：3)

箱重

底层操作不允许使用案例权重。

例子

if (FALSE) { # rlang::is_installed(c("RcppML", "ggplot2"))
library(recipes)
library(dplyr)
library(rlang)
library(ggplot2, quietly = TRUE)

set.seed(5522)
sim_dat <- data.frame(x1 = (20:100) / 10)
n <- nrow(sim_dat)
sim_dat$y1 <- sin(sim_dat$x1) + rnorm(n, sd = 0.1)
sim_dat$y2 <- cos(sim_dat$x1) + rnorm(n, sd = 0.1)
sim_dat$x2 <- runif(n)
sim_dat$x3 <- rnorm(n)

rec <- recipe(y1 + y2 ~ x1 + x2 + x3, data = sim_dat) %>%
  step_window(starts_with("y"),
    size = 7, statistic = "median",
    names = paste0("med_7pt_", 1:2),
    role = "outcome"
  ) %>%
  step_window(starts_with("y"),
    names = paste0("mean_3pt_", 1:2),
    role = "outcome"
  )
rec <- prep(rec, training = sim_dat)

smoothed_dat <- bake(rec, sim_dat, everything())

ggplot(data = sim_dat, aes(x = x1, y = y1)) +
  geom_point() +
  geom_line(data = smoothed_dat, aes(y = med_7pt_1)) +
  geom_line(data = smoothed_dat, aes(y = mean_3pt_1), col = "red") +
  theme_bw()

tidy(rec, number = 1)
tidy(rec, number = 2)

# If you want to replace the selected variables with the rolling statistic
# don't set `names`
sim_dat$original <- sim_dat$y1
rec <- recipe(y1 + y2 + original ~ x1 + x2 + x3, data = sim_dat) %>%
  step_window(starts_with("y"))
rec <- prep(rec, training = sim_dat)
smoothed_dat <- bake(rec, sim_dat, everything())
ggplot(smoothed_dat, aes(x = original, y = y1)) +
  geom_point() +
  theme_bw()
}

源代码：R/window.R

相关用法

注：本文由纯净天空筛选整理自Max Kuhn等大神的英文原创作品 Moving Window Functions。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。