R hardhat default_formula_blueprint 默認公式藍圖

此頁麵包含公式預處理藍圖的詳細信息。如果 x 是公式，則這是 mold() 默認使用的藍圖。

用法

default_formula_blueprint(
  intercept = FALSE,
  allow_novel_levels = FALSE,
  indicators = "traditional",
  composition = "tibble"
)

# S3 method for formula
mold(formula, data, ..., blueprint = NULL)

參數

intercept

一個合乎邏輯的。處理的數據中是否應該包含攔截？該信息由mold 和forge 函數列表中的process 函數使用。

allow_novel_levels

一個合乎邏輯的。在預測時是否應該允許新的因子水平？此信息由 forge 函數列表中的 clean 函數使用，並傳遞給 scream() 。

indicators

單個字符串。控製因子如何擴展到虛擬變量指標列。之一：

"traditional" - 默認值。使用傳統的 model.matrix() 基礎結構創建虛擬變量。通常，這會為每個因子創建 K - 1 指標列，其中 K 是該因子中的級別數。
"none" - 保留因子變量。沒有進行擴展。
"one_hot" - 使用 one-hot 編碼方法創建虛擬變量，該方法將無序因子擴展到所有 K 指標列，而不是 K - 1 。

composition

"tibble"、"matrix" 或 "dgCMatrix" 用於已處理預測變量的格式。如果選擇 "matrix" 或 "dgCMatrix"，則在應用預處理方法後，所有預測變量都必須為數值；否則會拋出錯誤。

formula

指定預測變量和結果的公式。

data

包含結果和預測變量的 DataFrame 或矩陣。

...

不曾用過。

blueprint

預處理blueprint。如果保留為NULL，則使用default_formula_blueprint()。

值

對於 default_formula_blueprint() ，公式藍圖。

細節

雖然與基本 R 沒有什麽不同，但當 indicators = "traditional" 和截距不存在時將因子擴展為虛擬變量的行為並不總是直觀的，應該記錄下來。

當存在截距時，因子將擴展為 K-1 新列，其中 K 是因子中的級別數。
當截距不存在時，第一個因子將擴展到所有K 列(one-hot 編碼)，其餘因子將擴展到K-1 列。此行為確保可以對第一個因子的參考水平做出有意義的預測，但不是所請求的確切的 "no intercept" 模型。如果沒有此行為，當沒有截距時，第一個因子的參考水平的預測將始終強製為0。

通過使用內聯函數 stats::offset() 可以將偏移量包含在公式方法中。它們以 tibble 形式返回，在返回值的 $extras$offset 槽中包含 1 個名為 ".offset" 的列。

模具

當mold()與默認公式藍圖一起使用時：

預測因子
- formula 的 RHS 被隔離，並轉換為其自己的 1 邊公式：~ RHS 。
- 在 RHS 公式上運行 stats::model.frame() 並使用 data 。
- 如果是indicators = "traditional"，則它會對結果運行stats::model.matrix()。
- 如果 indicators = "none" ，則在運行 model.matrix() 之前刪除因子，然後再添加返回。不允許涉及因子的交互或內聯函數。
- 如果是indicators = "one_hot"，則它會使用對比函數對結果運行stats::model.matrix()，該函數為所有因子的所有級別創建指標列。
- 如果使用 offset() 存在任何偏移，則使用 model_offset() 提取它們。
- 如果 intercept = TRUE ，則添加截距列。
- 將上述步驟的結果強製為 tibble。
結果
- formula 的 LHS 被隔離，並轉換為其自己的 1 邊公式：~ LHS 。
- 在 LHS 公式上運行 stats::model.frame() 並使用 data 。
- 將上述步驟的結果強製為 tibble。

鍛造

當forge()與默認公式藍圖一起使用時：

它調用 shrink() 將 new_data 修剪為僅所需的列，並將 new_data 強製為 tibble。
它調用 scream() 對 new_data 的列結構進行驗證。
預測因子
- 它使用與預測變量相對應的存儲術語對象在 new_data 上運行 stats::model.frame()。
- 如果在原始 mold() 調用中設置了 indicators = "traditional"，則會對結果運行 stats::model.matrix()。
- 如果在原始 mold() 調用中設置了 indicators = "none"，則它會在不帶因子列的結果上運行 stats::model.matrix()，然後將它們相加。
- 如果在原始 mold() 調用中設置了 indicators = "one_hot"，則它會使用包含所有因子列的所有級別的指標的對比函數對結果運行 stats::model.matrix()。
- 如果在對 mold() 的原始調用中使用 offset() 存在任何偏移，則使用 model_offset() 提取它們。
- 如果 intercept = TRUE 位於對 mold() 的原始調用中，則添加截距列。
- 它將上述步驟的結果強製為 tibble。
結果
- 它使用與結果相對應的存儲術語對象在 new_data 上運行 stats::model.frame()。
- 將結果強製為 tibble。

與 Base R 的差異

關於 mold() 處理公式的方式與基本 R 存在許多差異，需要一些解釋。

可以使用與 RHS 類似的語法(即 outcome_1 + outcome_2 ~ predictors )在 LHS 上指定多變量結果。如果在 LHS 上完成任何複雜的計算並且它們返回矩陣(如 stats::poly() )，則在調用 model.frame() 後，這些矩陣將被展平為 tibble 的多列。雖然這是可能的，但不建議這樣做，並且如果需要對結果進行大量預處理，那麽您最好使用 recipes::recipe() 。

公式中不允許使用全局變量。如果包含它們，將會拋出錯誤。公式中的所有項均應來自 data 。如果您需要在公式中使用內聯函數，最安全的方法是在它們前麵添加包名稱作為前綴，例如 pkg::fn() 。這可確保該函數在mold()(擬合)和forge()(預測)時間始終可用。也就是說，如果附加了包(即帶有 library() )，那麽您應該能夠使用不帶前綴的內聯函數。

默認情況下，截距不包含在公式的預測器輸出中。要包含攔截，請設置 blueprint = default_formula_blueprint(intercept = TRUE) 。這樣做的理由是，許多包總是需要或從不允許攔截(例如，earth 包)，並且它們做了大量額外的工作來阻止用戶提供或刪除它。該接口將所有靈活性標準化在一處。

例子

# ---------------------------------------------------------------------------

data("hardhat-example-data")

# ---------------------------------------------------------------------------
# Formula Example

# Call mold() with the training data
processed <- mold(
  log(num_1) ~ num_2 + fac_1,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)

# Then, call forge() with the blueprint and the test data
# to have it preprocess the test data in the same way
forge(example_test, processed$blueprint)
#> $predictors
#> # A tibble: 2 × 4
#>   `(Intercept)` num_2 fac_1b fac_1c
#>           <dbl> <dbl>  <dbl>  <dbl>
#> 1             1 0.967      0      0
#> 2             1 0.761      0      1
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# Use `outcomes = TRUE` to also extract the preprocessed outcome
forge(example_test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 2 × 4
#>   `(Intercept)` num_2 fac_1b fac_1c
#>           <dbl> <dbl>  <dbl>  <dbl>
#> 1             1 0.967      0      0
#> 2             1 0.761      0      1
#> 
#> $outcomes
#> # A tibble: 2 × 1
#>   `log(num_1)`
#>          <dbl>
#> 1         3.00
#> 2         3.04
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# ---------------------------------------------------------------------------
# Factors without an intercept

# No intercept is added by default
processed <- mold(num_1 ~ fac_1 + fac_2, example_train)

# So, for factor columns, the first factor is completely expanded into all
# `K` columns (the number of levels), and the subsequent factors are expanded
# into `K - 1` columns.
processed$predictors
#> # A tibble: 12 × 4
#>    fac_1a fac_1b fac_1c fac_2B
#>     <dbl>  <dbl>  <dbl>  <dbl>
#>  1      1      0      0      0
#>  2      1      0      0      1
#>  3      1      0      0      0
#>  4      1      0      0      1
#>  5      0      1      0      0
#>  6      0      1      0      1
#>  7      0      1      0      0
#>  8      0      1      0      1
#>  9      0      0      1      0
#> 10      0      0      1      1
#> 11      0      0      1      0
#> 12      0      0      1      1

# In the above example, `fac_1` is expanded into all three columns,
# `fac_2` is not. This behavior comes from `model.matrix()`, and is somewhat
# known in the R community, but can lead to a model that is difficult to
# interpret since the corresponding p-values are testing wildly different
# hypotheses.

# To get all indicators for all columns (irrespective of the intercept),
# use the `indicators = "one_hot"` option
processed <- mold(
  num_1 ~ fac_1 + fac_2,
  example_train,
  blueprint = default_formula_blueprint(indicators = "one_hot")
)

processed$predictors
#> # A tibble: 12 × 5
#>    fac_1a fac_1b fac_1c fac_2A fac_2B
#>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#>  1      1      0      0      1      0
#>  2      1      0      0      0      1
#>  3      1      0      0      1      0
#>  4      1      0      0      0      1
#>  5      0      1      0      1      0
#>  6      0      1      0      0      1
#>  7      0      1      0      1      0
#>  8      0      1      0      0      1
#>  9      0      0      1      1      0
#> 10      0      0      1      0      1
#> 11      0      0      1      1      0
#> 12      0      0      1      0      1

# It is not possible to construct a no-intercept model that expands all
# factors into `K - 1` columns using the formula method. If required, a
# recipe could be used to construct this model.

# ---------------------------------------------------------------------------
# Global variables

y <- rep(1, times = nrow(example_train))

# In base R, global variables are allowed in a model formula
frame <- model.frame(fac_1 ~ y + num_2, example_train)
head(frame)
#>   fac_1 y num_2
#> 1     a 1 0.579
#> 2     a 1 0.338
#> 3     a 1 0.206
#> 4     a 1 0.546
#> 5     b 1 0.964
#> 6     b 1 0.631

# mold() does not allow them, and throws an error
try(mold(fac_1 ~ y + num_2, example_train))
#> Error in get_all_predictors(formula, data) : 
#>   The following predictors were not found in `data`: 'y'.

# ---------------------------------------------------------------------------
# Dummy variables and interactions

# By default, factor columns are expanded
# and interactions are created, both by
# calling `model.matrix()`. Some models (like
# tree based models) can take factors directly
# but still might want to use the formula method.
# In those cases, set `indicators = "none"` to not
# run `model.matrix()` on factor columns. Interactions
# are still allowed and are run on numeric columns.

bp_no_indicators <- default_formula_blueprint(indicators = "none")

processed <- mold(
  ~ fac_1 + num_1:num_2,
  example_train,
  blueprint = bp_no_indicators
)

processed$predictors
#> # A tibble: 12 × 2
#>    `num_1:num_2` fac_1
#>            <dbl> <fct>
#>  1         0.579 a    
#>  2         0.676 a    
#>  3         0.618 a    
#>  4         2.18  a    
#>  5         4.82  b    
#>  6         3.79  b    
#>  7         5.66  b    
#>  8         1.66  b    
#>  9         2.84  c    
#> 10         0.83  c    
#> 11         6.81  c    
#> 12         7.42  c    

# An informative error is thrown when `indicators = "none"` and
# factors are present in interaction terms or in inline functions
try(mold(num_1 ~ num_2:fac_1, example_train, blueprint = bp_no_indicators))
#> Error in mold_formula_default_process_predictors(blueprint = blueprint,  : 
#>   Interaction terms involving factors or characters have been
#> detected on the RHS of `formula`. These are not allowed when `indicators
#> = "none"`.
#> ℹ Interactions terms involving factors were detected for "fac_1" in
#>   `num_2:fac_1`.
try(mold(num_1 ~ paste0(fac_1), example_train, blueprint = bp_no_indicators))
#> Error in mold_formula_default_process_predictors(blueprint = blueprint,  : 
#>   Functions involving factors or characters have been detected on
#> the RHS of `formula`. These are not allowed when `indicators = "none"`.
#> ℹ Functions involving factors were detected for "fac_1" in
#>   `paste0(fac_1)`.

# ---------------------------------------------------------------------------
# Multivariate outcomes

# Multivariate formulas can be specified easily
processed <- mold(num_1 + log(num_2) ~ fac_1, example_train)
processed$outcomes
#> # A tibble: 12 × 2
#>    num_1 `log(num_2)`
#>    <int>        <dbl>
#>  1     1      -0.546 
#>  2     2      -1.08  
#>  3     3      -1.58  
#>  4     4      -0.605 
#>  5     5      -0.0367
#>  6     6      -0.460 
#>  7     7      -0.213 
#>  8     8      -1.57  
#>  9     9      -1.15  
#> 10    10      -2.49  
#> 11    11      -0.480 
#> 12    12      -0.481 

# Inline functions on the LHS are run, but any matrix
# output is flattened (like what happens in `model.matrix()`)
# (essentially this means you don't wind up with columns
# in the tibble that are matrices)
processed <- mold(poly(num_2, degree = 2) ~ fac_1, example_train)
processed$outcomes
#> # A tibble: 12 × 2
#>    `poly(num_2, degree = 2).1` `poly(num_2, degree = 2).2`
#>                          <dbl>                       <dbl>
#>  1                      0.0981                      -0.254
#>  2                     -0.177                       -0.157
#>  3                     -0.327                        0.108
#>  4                      0.0604                      -0.270
#>  5                      0.537                        0.634
#>  6                      0.157                       -0.209
#>  7                      0.359                        0.120
#>  8                     -0.325                        0.103
#>  9                     -0.202                       -0.124
#> 10                     -0.468                        0.492
#> 11                      0.144                       -0.221
#> 12                      0.143                       -0.222

# TRUE
ncol(processed$outcomes) == 2
#> [1] TRUE

# Multivariate formulas specified in mold()
# carry over into forge()
forge(example_test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 2 × 3
#>   fac_1a fac_1b fac_1c
#>    <dbl>  <dbl>  <dbl>
#> 1      1      0      0
#> 2      0      0      1
#> 
#> $outcomes
#> # A tibble: 2 × 2
#>   `poly(num_2, degree = 2).1` `poly(num_2, degree = 2).2`
#>                         <dbl>                       <dbl>
#> 1                       0.541                     0.646  
#> 2                       0.306                     0.00619
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# ---------------------------------------------------------------------------
# Offsets

# Offsets are handled specially in base R, so they deserve special
# treatment here as well. You can add offsets using the inline function
# `offset()`
processed <- mold(num_1 ~ offset(num_2) + fac_1, example_train)

processed$extras$offset
#> # A tibble: 12 × 1
#>    .offset
#>      <dbl>
#>  1   0.579
#>  2   0.338
#>  3   0.206
#>  4   0.546
#>  5   0.964
#>  6   0.631
#>  7   0.808
#>  8   0.208
#>  9   0.316
#> 10   0.083
#> 11   0.619
#> 12   0.618

# Multiple offsets can be included, and they get added together
processed <- mold(
  num_1 ~ offset(num_2) + offset(num_3),
  example_train
)

identical(
  processed$extras$offset$.offset,
  example_train$num_2 + example_train$num_3
)
#> [1] TRUE

# Forging test data will also require
# and include the offset
forge(example_test, processed$blueprint)
#> $predictors
#> # A tibble: 2 × 0
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> $extras$offset
#> # A tibble: 2 × 1
#>   .offset
#>     <dbl>
#> 1   1.06 
#> 2   0.802
#> 
#> 

# ---------------------------------------------------------------------------
# Intercept only

# Because `1` and `0` are intercept modifying terms, they are
# not allowed in the formula and are instead controlled by the
# `intercept` argument of the blueprint. To use an intercept
# only formula, you should supply `NULL` on the RHS of the formula.
mold(
  ~NULL,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)
#> $predictors
#> # A tibble: 12 × 1
#>    `(Intercept)`
#>            <dbl>
#>  1             1
#>  2             1
#>  3             1
#>  4             1
#>  5             1
#>  6             1
#>  7             1
#>  8             1
#>  9             1
#> 10             1
#> 11             1
#> 12             1
#> 
#> $outcomes
#> # A tibble: 12 × 0
#> 
#> $blueprint
#> Formula blueprint: 
#>  
#> # Predictors: 0 
#>   # Outcomes: 0 
#>    Intercept: TRUE 
#> Novel Levels: FALSE 
#>  Composition: tibble 
#>   Indicators: traditional 
#> 
#> $extras
#> $extras$offset
#> NULL
#> 
#> 

# ---------------------------------------------------------------------------
# Matrix output for predictors

# You can change the `composition` of the predictor data set
bp <- default_formula_blueprint(composition = "dgCMatrix")
processed <- mold(log(num_1) ~ num_2 + fac_1, example_train, blueprint = bp)
class(processed$predictors)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

源代碼：R/blueprint-formula-default.R、R/mold.R

相關用法

注：本文由純淨天空篩選整理自Davis Vaughan等大神的英文原創作品 Default formula blueprint。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。