R recipes recipe 创建预处理数据的配方

配方是对应用于数据集以便为数据分析做好准备的步骤的说明。

用法

recipe(x, ...)

# S3 method for default
recipe(x, ...)

# S3 method for data.frame
recipe(x, formula = NULL, ..., vars = NULL, roles = NULL)

# S3 method for formula
recipe(formula, data, ...)

# S3 method for matrix
recipe(x, ...)

参数

x, data: 模板数据集的 DataFrame 或小标题(见下文)。
...: 传递给其他方法或从其他方法传递的更多参数(当前未使用)。
formula: 模型公式。这里不应使用 in-line 函数(例如 log(x) 、 x:y 等)，并且不允许使用减号。这些类型的转换应使用此包中的 step 函数来执行。允许使用点，就像简单的多元结果项一样(即不需要 cbind ；请参阅示例)。由于内存问题，模型公式可能不是具有多列的高维数据的最佳选择。
vars: 与将在任何上下文中使用的变量相对应的列名称的字符串(见下文)
roles: 说明变量将扮演的单个角色的字符串(与 vars 相同的长度)。该值可以是任何值，但常见角色是 "outcome" 、 "predictor" 、 "case_weight" 或 "ID"

值

类recipe 和sub-objects 的对象：

var_info: 包含原始数据集列信息的 tibble
term_info: 包含数据集中当前术语集的小标题。这最初默认为 var_info 中包含的相同数据。
steps: step 或 check 对象的列表，定义将应用于数据的预处理操作的序列。默认值为NULL
template: 数据的一小部分。它被初始化为与 data 参数中给出的数据相同，但在配方训练后可能会有所不同。

细节

定义食谱

配方中的变量可以具有任何类型的作用，包括结果、预测变量、观察 ID、案例权重、分层变量等。

recipe 对象可以通过多种方式创建。如果分析仅包含结果和预测变量，则创建分析的最简单方法是使用不包含内联函数(例如 log(x3))的公式(例如 y ~ x1 + x2 )(请参见下面的第一个示例)。

或者，可以通过首先指定应使用数据集中的哪些变量，然后按顺序定义它们的角色来创建 recipe 对象(请参阅最后一个示例)。当变量数量非常多时，这种替代方法是一个很好的选择，因为公式方法是memory-inefficient，有很多变量。

有两种不同类型的操作可以按顺序添加到配方中。

步骤可以包括缩放变量、创建虚拟变量或交互等操作。还可以指定计算上更复杂的操作，例如降维或插补。
检查是对数据进行特定测试的操作。当测试满足时，数据将被返回，没有问题或修改。否则，会抛出错误。

如果您已定义配方并希望查看其中包含哪些步骤，请对配方对象使用 tidy() 方法。

请注意，传递给 recipe() 的数据不必是将用于训练步骤的完整数据(通过 prep() )。配方只需要知道将使用的数据的名称和类型。对于大型数据集，可以使用 head() 传递较小的数据集以节省时间和内存。

使用食谱

一旦定义了配方，就需要在应用于数据之前对其进行估计。大多数配方步骤都有必须计算或估计的特定数量。例如，step_normalize() 需要计算所选列的训练集均值，而 step_dummy() 需要确定所选列的因子水平，以便生成适当的指标列。

配方最常见的两个应用是建模和stand-alone预处理。如何估计配方取决于它的使用方式。

造型

使用配方进行建模的最佳方法是通过 workflows 包。这将模型和预处理器(例如菜谱)捆绑在一起，并为用户提供了一种流畅的方式来训练模型/菜谱并进行预测。

library(dplyr)
library(workflows)
library(recipes)
library(parsnip)

data(biomass, package = "modeldata")

# split data
biomass_tr <- biomass %>% filter(dataset == "Training")
biomass_te <- biomass %>% filter(dataset == "Testing")

# With only predictors and outcomes, use a formula:
rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
              data = biomass_tr)

# Now add preprocessing steps to the recipe:
sp_signed <- 
  rec %>%
  step_normalize(all_numeric_predictors()) %>%
  step_spatialsign(all_numeric_predictors())
sp_signed

## 

## -- Recipe ------------------------------------------------------------

## 

## -- Inputs

## Number of variables by role

## outcome:   1
## predictor: 5

## 

## -- Operations

## * Centering and scaling for: all_numeric_predictors()

## * Spatial sign on: all_numeric_predictors()

我们可以创建一个 parsnip 模型，然后使用该模型和配方构建工作流程：

linear_mod <- linear_reg()

linear_sp_sign_wflow <- 
  workflow() %>% 
  add_model(linear_mod) %>% 
  add_recipe(sp_signed)

linear_sp_sign_wflow

## == Workflow ==========================================================
## Preprocessor: Recipe
## Model: linear_reg()
## 
## -- Preprocessor ------------------------------------------------------
## 2 Recipe Steps
## 
## * step_normalize()
## * step_spatialsign()
## 
## -- Model -------------------------------------------------------------
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

为了估计预处理步骤，然后拟合线性模型，使用对 fit() 的单次调用：

linear_sp_sign_fit <- fit(linear_sp_sign_wflow, data = biomass_tr)

预测时，除了调用 predict() 之外，无需执行任何操作。这以与训练集相同的方式预处理新数据，然后将数据提供给线性模型预测代码：

predict(linear_sp_sign_fit, new_data = head(biomass_te))

## # A tibble: 6 x 1
##   .pred
##   <dbl>
## 1  18.1
## 2  17.9
## 3  17.2
## 4  18.8
## 5  19.6
## 6  14.6

Stand-alone食谱的使用

当使用配方生成数据以进行可视化或解决配方的任何问题时，可以使用一些函数来估计配方并将其手动应用到新数据。

定义配方后，prep() 函数可用于使用数据集(也称为训练数据)来估计操作所需的数量。 prep() 返回一个配方。

作为使用 PCA 的示例(可能是为了生成绘图)：

# Define the recipe
pca_rec <- 
  rec %>%
  step_normalize(all_numeric_predictors()) %>%
  step_pca(all_numeric_predictors())

现在估计归一化统计数据和 PCA 负载：

pca_rec <- prep(pca_rec, training = biomass_tr)
pca_rec

## 

## -- Recipe ------------------------------------------------------------

## 

## -- Inputs

## Number of variables by role

## outcome:   1
## predictor: 5

## 

## -- Training information

## Training data contained 456 data points and no incomplete rows.

## 

## -- Operations

## * Centering and scaling for: carbon, hydrogen, oxygen, ... | Trained

## * PCA extraction with: carbon, hydrogen, oxygen, ... | Trained

请注意，估计的配方显示了选择器捕获的实际列名称。

您可以 tidy.recipe() 配方(无论是已准备还是未准备)来了解有关其组件的更多信息。

tidy(pca_rec)

## # A tibble: 2 x 6
##   number operation type      trained skip  id             
##    <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
## 1      1 step      normalize TRUE    FALSE normalize_AeYA4
## 2      2 step      pca       TRUE    FALSE pca_Zn1yz

您还可以使用number 或id 参数来tidy() 配方步骤。

要将准备好的配方应用于数据集，请使用 bake() 函数，其使用方式与 predict() 用于模型的方式相同。这将估计的步骤应用于任何数据集。

bake(pca_rec, head(biomass_te))

## # A tibble: 6 x 6
##     HHV    PC1    PC2     PC3     PC4     PC5
##   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1  18.3 0.730  -0.412 -0.495   0.333   0.253 
## 2  17.6 0.617   1.41   0.118  -0.466   0.815 
## 3  17.2 0.761   1.10  -0.0550 -0.397   0.747 
## 4  18.9 0.0400  0.950  0.158   0.405  -0.143 
## 5  20.5 0.792  -0.732  0.204   0.465  -0.148 
## 6  18.5 0.433  -0.127 -0.354  -0.0168 -0.0888

一般来说，对于大多数应用程序，建议使用配方的工作流程接口。

例子


# formula example with single outcome:
data(biomass, package = "modeldata")

# split data
biomass_tr <- biomass[biomass$dataset == "Training", ]
biomass_te <- biomass[biomass$dataset == "Testing", ]

# With only predictors and outcomes, use a formula
rec <- recipe(
  HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
  data = biomass_tr
)

# Now add preprocessing steps to the recipe
sp_signed <- rec %>%
  step_normalize(all_numeric_predictors()) %>%
  step_spatialsign(all_numeric_predictors())
sp_signed
#> 
#> ── Recipe ────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:   1
#> predictor: 5
#> 
#> ── Operations 
#> • Centering and scaling for: all_numeric_predictors()
#> • Spatial sign on: all_numeric_predictors()

# ---------------------------------------------------------------------------
# formula multivariate example:
# no need for `cbind(carbon, hydrogen)` for left-hand side

multi_y <- recipe(carbon + hydrogen ~ oxygen + nitrogen + sulfur,
  data = biomass_tr
)
multi_y <- multi_y %>%
  step_center(all_numeric_predictors()) %>%
  step_scale(all_numeric_predictors())

# ---------------------------------------------------------------------------
# example using `update_role` instead of formula:
# best choice for high-dimensional data

rec <- recipe(biomass_tr) %>%
  update_role(carbon, hydrogen, oxygen, nitrogen, sulfur,
    new_role = "predictor"
  ) %>%
  update_role(HHV, new_role = "outcome") %>%
  update_role(sample, new_role = "id variable") %>%
  update_role(dataset, new_role = "splitting indicator")
rec
#> 
#> ── Recipe ────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:             1
#> predictor:           5
#> id variable:         1
#> splitting indicator: 1

源代码：R/recipe.R

相关用法

注：本文由纯净天空筛选整理自Max Kuhn等大神的英文原创作品 Create a recipe for preprocessing data。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。