R hardhat default_formula_blueprint 默认公式蓝图

此页面包含公式预处理蓝图的详细信息。如果 x 是公式，则这是 mold() 默认使用的蓝图。

用法

default_formula_blueprint(
  intercept = FALSE,
  allow_novel_levels = FALSE,
  indicators = "traditional",
  composition = "tibble"
)

# S3 method for formula
mold(formula, data, ..., blueprint = NULL)

参数

intercept

一个合乎逻辑的。处理的数据中是否应该包含拦截？该信息由mold 和forge 函数列表中的process 函数使用。

allow_novel_levels

一个合乎逻辑的。在预测时是否应该允许新的因子水平？此信息由 forge 函数列表中的 clean 函数使用，并传递给 scream() 。

indicators

单个字符串。控制因子如何扩展到虚拟变量指标列。之一：

"traditional" - 默认值。使用传统的 model.matrix() 基础结构创建虚拟变量。通常，这会为每个因子创建 K - 1 指标列，其中 K 是该因子中的级别数。
"none" - 保留因子变量。没有进行扩展。
"one_hot" - 使用 one-hot 编码方法创建虚拟变量，该方法将无序因子扩展到所有 K 指标列，而不是 K - 1 。

composition

"tibble"、"matrix" 或 "dgCMatrix" 用于已处理预测变量的格式。如果选择 "matrix" 或 "dgCMatrix"，则在应用预处理方法后，所有预测变量都必须为数值；否则会抛出错误。

formula

指定预测变量和结果的公式。

data

包含结果和预测变量的 DataFrame 或矩阵。

...

不曾用过。

blueprint

预处理blueprint。如果保留为NULL，则使用default_formula_blueprint()。

值

对于 default_formula_blueprint() ，公式蓝图。

细节

虽然与基本 R 没有什么不同，但当 indicators = "traditional" 和截距不存在时将因子扩展为虚拟变量的行为并不总是直观的，应该记录下来。

当存在截距时，因子将扩展为 K-1 新列，其中 K 是因子中的级别数。
当截距不存在时，第一个因子将扩展到所有K 列(one-hot 编码)，其余因子将扩展到K-1 列。此行为确保可以对第一个因子的参考水平做出有意义的预测，但不是所请求的确切的 "no intercept" 模型。如果没有此行为，当没有截距时，第一个因子的参考水平的预测将始终强制为0。

通过使用内联函数 stats::offset() 可以将偏移量包含在公式方法中。它们以 tibble 形式返回，在返回值的 $extras$offset 槽中包含 1 个名为 ".offset" 的列。

模具

当mold()与默认公式蓝图一起使用时：

预测因子
- formula 的 RHS 被隔离，并转换为其自己的 1 边公式：~ RHS 。
- 在 RHS 公式上运行 stats::model.frame() 并使用 data 。
- 如果是indicators = "traditional"，则它会对结果运行stats::model.matrix()。
- 如果 indicators = "none" ，则在运行 model.matrix() 之前删除因子，然后再添加返回。不允许涉及因子的交互或内联函数。
- 如果是indicators = "one_hot"，则它会使用对比函数对结果运行stats::model.matrix()，该函数为所有因子的所有级别创建指标列。
- 如果使用 offset() 存在任何偏移，则使用 model_offset() 提取它们。
- 如果 intercept = TRUE ，则添加截距列。
- 将上述步骤的结果强制为 tibble。
结果
- formula 的 LHS 被隔离，并转换为其自己的 1 边公式：~ LHS 。
- 在 LHS 公式上运行 stats::model.frame() 并使用 data 。
- 将上述步骤的结果强制为 tibble。

锻造

当forge()与默认公式蓝图一起使用时：

它调用 shrink() 将 new_data 修剪为仅所需的列，并将 new_data 强制为 tibble。
它调用 scream() 对 new_data 的列结构进行验证。
预测因子
- 它使用与预测变量相对应的存储术语对象在 new_data 上运行 stats::model.frame()。
- 如果在原始 mold() 调用中设置了 indicators = "traditional"，则会对结果运行 stats::model.matrix()。
- 如果在原始 mold() 调用中设置了 indicators = "none"，则它会在不带因子列的结果上运行 stats::model.matrix()，然后将它们相加。
- 如果在原始 mold() 调用中设置了 indicators = "one_hot"，则它会使用包含所有因子列的所有级别的指标的对比函数对结果运行 stats::model.matrix()。
- 如果在对 mold() 的原始调用中使用 offset() 存在任何偏移，则使用 model_offset() 提取它们。
- 如果 intercept = TRUE 位于对 mold() 的原始调用中，则添加截距列。
- 它将上述步骤的结果强制为 tibble。
结果
- 它使用与结果相对应的存储术语对象在 new_data 上运行 stats::model.frame()。
- 将结果强制为 tibble。

与 Base R 的差异

关于 mold() 处理公式的方式与基本 R 存在许多差异，需要一些解释。

可以使用与 RHS 类似的语法(即 outcome_1 + outcome_2 ~ predictors )在 LHS 上指定多变量结果。如果在 LHS 上完成任何复杂的计算并且它们返回矩阵(如 stats::poly() )，则在调用 model.frame() 后，这些矩阵将被展平为 tibble 的多列。虽然这是可能的，但不建议这样做，并且如果需要对结果进行大量预处理，那么您最好使用 recipes::recipe() 。

公式中不允许使用全局变量。如果包含它们，将会抛出错误。公式中的所有项均应来自 data 。如果您需要在公式中使用内联函数，最安全的方法是在它们前面添加包名称作为前缀，例如 pkg::fn() 。这可确保该函数在mold()(拟合)和forge()(预测)时间始终可用。也就是说，如果附加了包(即带有 library() )，那么您应该能够使用不带前缀的内联函数。

默认情况下，截距不包含在公式的预测器输出中。要包含拦截，请设置 blueprint = default_formula_blueprint(intercept = TRUE) 。这样做的理由是，许多包总是需要或从不允许拦截(例如，earth 包)，并且它们做了大量额外的工作来阻止用户提供或删除它。该接口将所有灵活性标准化在一处。

例子

# ---------------------------------------------------------------------------

data("hardhat-example-data")

# ---------------------------------------------------------------------------
# Formula Example

# Call mold() with the training data
processed <- mold(
  log(num_1) ~ num_2 + fac_1,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)

# Then, call forge() with the blueprint and the test data
# to have it preprocess the test data in the same way
forge(example_test, processed$blueprint)
#> $predictors
#> # A tibble: 2 × 4
#>   `(Intercept)` num_2 fac_1b fac_1c
#>           <dbl> <dbl>  <dbl>  <dbl>
#> 1             1 0.967      0      0
#> 2             1 0.761      0      1
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# Use `outcomes = TRUE` to also extract the preprocessed outcome
forge(example_test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 2 × 4
#>   `(Intercept)` num_2 fac_1b fac_1c
#>           <dbl> <dbl>  <dbl>  <dbl>
#> 1             1 0.967      0      0
#> 2             1 0.761      0      1
#> 
#> $outcomes
#> # A tibble: 2 × 1
#>   `log(num_1)`
#>          <dbl>
#> 1         3.00
#> 2         3.04
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# ---------------------------------------------------------------------------
# Factors without an intercept

# No intercept is added by default
processed <- mold(num_1 ~ fac_1 + fac_2, example_train)

# So, for factor columns, the first factor is completely expanded into all
# `K` columns (the number of levels), and the subsequent factors are expanded
# into `K - 1` columns.
processed$predictors
#> # A tibble: 12 × 4
#>    fac_1a fac_1b fac_1c fac_2B
#>     <dbl>  <dbl>  <dbl>  <dbl>
#>  1      1      0      0      0
#>  2      1      0      0      1
#>  3      1      0      0      0
#>  4      1      0      0      1
#>  5      0      1      0      0
#>  6      0      1      0      1
#>  7      0      1      0      0
#>  8      0      1      0      1
#>  9      0      0      1      0
#> 10      0      0      1      1
#> 11      0      0      1      0
#> 12      0      0      1      1

# In the above example, `fac_1` is expanded into all three columns,
# `fac_2` is not. This behavior comes from `model.matrix()`, and is somewhat
# known in the R community, but can lead to a model that is difficult to
# interpret since the corresponding p-values are testing wildly different
# hypotheses.

# To get all indicators for all columns (irrespective of the intercept),
# use the `indicators = "one_hot"` option
processed <- mold(
  num_1 ~ fac_1 + fac_2,
  example_train,
  blueprint = default_formula_blueprint(indicators = "one_hot")
)

processed$predictors
#> # A tibble: 12 × 5
#>    fac_1a fac_1b fac_1c fac_2A fac_2B
#>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1      1      0      0      1      0
#>  2      1      0      0      0      1
#>  3      1      0      0      1      0
#>  4      1      0      0      0      1
#>  5      0      1      0      1      0
#>  6      0      1      0      0      1
#>  7      0      1      0      1      0
#>  8      0      1      0      0      1
#>  9      0      0      1      1      0
#> 10      0      0      1      0      1
#> 11      0      0      1      1      0
#> 12      0      0      1      0      1

# It is not possible to construct a no-intercept model that expands all
# factors into `K - 1` columns using the formula method. If required, a
# recipe could be used to construct this model.

# ---------------------------------------------------------------------------
# Global variables

y <- rep(1, times = nrow(example_train))

# In base R, global variables are allowed in a model formula
frame <- model.frame(fac_1 ~ y + num_2, example_train)
head(frame)
#>   fac_1 y num_2
#> 1     a 1 0.579
#> 2     a 1 0.338
#> 3     a 1 0.206
#> 4     a 1 0.546
#> 5     b 1 0.964
#> 6     b 1 0.631

# mold() does not allow them, and throws an error
try(mold(fac_1 ~ y + num_2, example_train))
#> Error in get_all_predictors(formula, data) : 
#>   The following predictors were not found in `data`: 'y'.

# ---------------------------------------------------------------------------
# Dummy variables and interactions

# By default, factor columns are expanded
# and interactions are created, both by
# calling `model.matrix()`. Some models (like
# tree based models) can take factors directly
# but still might want to use the formula method.
# In those cases, set `indicators = "none"` to not
# run `model.matrix()` on factor columns. Interactions
# are still allowed and are run on numeric columns.

bp_no_indicators <- default_formula_blueprint(indicators = "none")

processed <- mold(
  ~ fac_1 + num_1:num_2,
  example_train,
  blueprint = bp_no_indicators
)

processed$predictors
#> # A tibble: 12 × 2
#>    `num_1:num_2` fac_1
#>            <dbl> <fct>
#>  1         0.579 a    
#>  2         0.676 a    
#>  3         0.618 a    
#>  4         2.18  a    
#>  5         4.82  b    
#>  6         3.79  b    
#>  7         5.66  b    
#>  8         1.66  b    
#>  9         2.84  c    
#> 10         0.83  c    
#> 11         6.81  c    
#> 12         7.42  c    

# An informative error is thrown when `indicators = "none"` and
# factors are present in interaction terms or in inline functions
try(mold(num_1 ~ num_2:fac_1, example_train, blueprint = bp_no_indicators))
#> Error in mold_formula_default_process_predictors(blueprint = blueprint,  : 
#>   Interaction terms involving factors or characters have been
#> detected on the RHS of `formula`. These are not allowed when `indicators
#> = "none"`.
#> ℹ Interactions terms involving factors were detected for "fac_1" in
#>   `num_2:fac_1`.
try(mold(num_1 ~ paste0(fac_1), example_train, blueprint = bp_no_indicators))
#> Error in mold_formula_default_process_predictors(blueprint = blueprint,  : 
#>   Functions involving factors or characters have been detected on
#> the RHS of `formula`. These are not allowed when `indicators = "none"`.
#> ℹ Functions involving factors were detected for "fac_1" in
#>   `paste0(fac_1)`.

# ---------------------------------------------------------------------------
# Multivariate outcomes

# Multivariate formulas can be specified easily
processed <- mold(num_1 + log(num_2) ~ fac_1, example_train)
processed$outcomes
#> # A tibble: 12 × 2
#>    num_1 `log(num_2)`
#>    <int>        <dbl>
#>  1     1      -0.546 
#>  2     2      -1.08  
#>  3     3      -1.58  
#>  4     4      -0.605 
#>  5     5      -0.0367
#>  6     6      -0.460 
#>  7     7      -0.213 
#>  8     8      -1.57  
#>  9     9      -1.15  
#> 10    10      -2.49  
#> 11    11      -0.480 
#> 12    12      -0.481 

# Inline functions on the LHS are run, but any matrix
# output is flattened (like what happens in `model.matrix()`)
# (essentially this means you don't wind up with columns
# in the tibble that are matrices)
processed <- mold(poly(num_2, degree = 2) ~ fac_1, example_train)
processed$outcomes
#> # A tibble: 12 × 2
#>    `poly(num_2, degree = 2).1` `poly(num_2, degree = 2).2`
#>                          <dbl>                       <dbl>
#>  1                      0.0981                      -0.254
#>  2                     -0.177                       -0.157
#>  3                     -0.327                        0.108
#>  4                      0.0604                      -0.270
#>  5                      0.537                        0.634
#>  6                      0.157                       -0.209
#>  7                      0.359                        0.120
#>  8                     -0.325                        0.103
#>  9                     -0.202                       -0.124
#> 10                     -0.468                        0.492
#> 11                      0.144                       -0.221
#> 12                      0.143                       -0.222

# TRUE
ncol(processed$outcomes) == 2
#> [1] TRUE

# Multivariate formulas specified in mold()
# carry over into forge()
forge(example_test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 2 × 3
#>   fac_1a fac_1b fac_1c
#>    <dbl>  <dbl>  <dbl>
#> 1      1      0      0
#> 2      0      0      1
#> 
#> $outcomes
#> # A tibble: 2 × 2
#>   `poly(num_2, degree = 2).1` `poly(num_2, degree = 2).2`
#>                         <dbl>                       <dbl>
#> 1                       0.541                     0.646  
#> 2                       0.306                     0.00619
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# ---------------------------------------------------------------------------
# Offsets

# Offsets are handled specially in base R, so they deserve special
# treatment here as well. You can add offsets using the inline function
# `offset()`
processed <- mold(num_1 ~ offset(num_2) + fac_1, example_train)

processed$extras$offset
#> # A tibble: 12 × 1
#>    .offset
#>      <dbl>
#>  1   0.579
#>  2   0.338
#>  3   0.206
#>  4   0.546
#>  5   0.964
#>  6   0.631
#>  7   0.808
#>  8   0.208
#>  9   0.316
#> 10   0.083
#> 11   0.619
#> 12   0.618

# Multiple offsets can be included, and they get added together
processed <- mold(
  num_1 ~ offset(num_2) + offset(num_3),
  example_train
)

identical(
  processed$extras$offset$.offset,
  example_train$num_2 + example_train$num_3
)
#> [1] TRUE

# Forging test data will also require
# and include the offset
forge(example_test, processed$blueprint)
#> $predictors
#> # A tibble: 2 × 0
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> $extras$offset
#> # A tibble: 2 × 1
#>   .offset
#>     <dbl>
#> 1   1.06 
#> 2   0.802
#> 
#> 

# ---------------------------------------------------------------------------
# Intercept only

# Because `1` and `0` are intercept modifying terms, they are
# not allowed in the formula and are instead controlled by the
# `intercept` argument of the blueprint. To use an intercept
# only formula, you should supply `NULL` on the RHS of the formula.
mold(
  ~NULL,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)
#> $predictors
#> # A tibble: 12 × 1
#>    `(Intercept)`
#>            <dbl>
#>  1             1
#>  2             1
#>  3             1
#>  4             1
#>  5             1
#>  6             1
#>  7             1
#>  8             1
#>  9             1
#> 10             1
#> 11             1
#> 12             1
#> 
#> $outcomes
#> # A tibble: 12 × 0
#> 
#> $blueprint
#> Formula blueprint: 
#>  
#> # Predictors: 0 
#>   # Outcomes: 0 
#>    Intercept: TRUE 
#> Novel Levels: FALSE 
#>  Composition: tibble 
#>   Indicators: traditional 
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# ---------------------------------------------------------------------------
# Matrix output for predictors

# You can change the `composition` of the predictor data set
bp <- default_formula_blueprint(composition = "dgCMatrix")
processed <- mold(log(num_1) ~ num_2 + fac_1, example_train, blueprint = bp)
class(processed$predictors)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

源代码：R/blueprint-formula-default.R、R/mold.R

相关用法

注：本文由纯净天空筛选整理自Davis Vaughan等大神的英文原创作品 Default formula blueprint。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。