R workflows add_formula 將公式術語添加到工作流程

add_formula() 通過使用公式指定模型的項。
remove_formula() 刪除公式以及使用公式進行預處理後可能創建的任何下遊對象，例如術語。此外，如果模型已經擬合，則擬合將被刪除。
update_formula() 首先刪除公式，然後用新公式替換以前的公式。任何已經根據此公式擬合的模型都需要重新擬合。

用法

add_formula(x, formula, ..., blueprint = NULL)

remove_formula(x)

update_formula(x, formula, ..., blueprint = NULL)

參數

x

工作流程

formula

指定模型項的公式。建議不要在公式中進行預處理，而是在需要時使用配方。

...

不曾用過。

blueprint

用於微調預處理的安全帽藍圖。

如果使用 NULL 、hardhat::default_formula_blueprint() 並傳遞與工作流中存在的模型最相符的參數。

請注意，此處完成的預處理與底層模型可能完成的預處理是分開的。例如，如果指定了 indicators = "none" 的藍圖，hardhat 不會創建虛擬變量，但如果底層模型需要內部使用 stats::model.matrix() 的公式接口，則模型仍會將因子擴展為虛擬變量。

值

x ，使用新的或刪除的公式預處理器進行更新。

細節

要適應工作流程，必須指定 add_formula() 、 add_recipe() 或 add_variables() 之一。

配方處理

請注意，對於不同的模型，add_formula() 的公式可能會以不同的方式處理，具體取決於所使用的防風草模型。例如，使用 Ranger 擬合的隨機森林模型不會將任何因子預測變量轉換為二元指示變量。這與 ranger::ranger() 的做法一致，但與 stats::model.matrix() 的做法不一致。

防風草模型的文檔提供了有關如何為模型編碼公式中給出的數據(如果它們與標準 model.matrix() 方法不同)的詳細信息。我們的目標是與底層模型包的工作方式保持一致。

這個公式是如何使用的呢？

為了進行演示，下麵的示例使用 lm() 來擬合模型。給 add_formula() 的公式用於創建模型矩陣，這就是通過 body_mass_g ~ . 的簡單公式傳遞給 lm() 的內容：

library(parsnip)
library(workflows)
library(magrittr)
library(modeldata)
library(hardhat)

data(penguins)

lm_mod <- linear_reg() %>% 
  set_engine("lm")

lm_wflow <- workflow() %>% 
  add_model(lm_mod)

pre_encoded <- lm_wflow %>% 
  add_formula(body_mass_g ~ species + island + bill_depth_mm) %>% 
  fit(data = penguins)

pre_encoded_parsnip_fit <- pre_encoded %>% 
  extract_fit_parsnip()

pre_encoded_fit <- pre_encoded_parsnip_fit$fit

# The `lm()` formula is *not* the same as the `add_formula()` formula: 
pre_encoded_fit

## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##      (Intercept)  speciesChinstrap     speciesGentoo  
##        -1009.943             1.328          2236.865  
##      islandDream   islandTorgersen     bill_depth_mm  
##            9.221           -18.433           256.913

這可能會影響結果的分析方式。例如，為了獲得順序假設檢驗，需要測試每個單獨的項：

anova(pre_encoded_fit)

## Analysis of Variance Table
## 
## Response: ..y
##                   Df    Sum Sq   Mean Sq  F value Pr(>F)    
## speciesChinstrap   1  18642821  18642821 141.1482 <2e-16 ***
## speciesGentoo      1 128221393 128221393 970.7875 <2e-16 ***
## islandDream        1     13399     13399   0.1014 0.7503    
## islandTorgersen    1       255       255   0.0019 0.9650    
## bill_depth_mm      1  28051023  28051023 212.3794 <2e-16 ***
## Residuals        336  44378805    132080                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

覆蓋默認編碼

用戶可以使用安全帽藍圖覆蓋 model-specific 編碼。藍圖可以指定因子的編碼方式以及是否包含截距。例如，如果您使用公式並希望將數據不受影響地傳遞到模型：

minimal <- default_formula_blueprint(indicators = "none", intercept = FALSE)

un_encoded <- lm_wflow %>% 
  add_formula(
    body_mass_g ~ species + island + bill_depth_mm, 
    blueprint = minimal
  ) %>% 
  fit(data = penguins)

un_encoded_parsnip_fit <- un_encoded %>% 
  extract_fit_parsnip()

un_encoded_fit <- un_encoded_parsnip_fit$fit

un_encoded_fit

## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##      (Intercept)     bill_depth_mm  speciesChinstrap  
##        -1009.943           256.913             1.328  
##    speciesGentoo       islandDream   islandTorgersen  
##         2236.865             9.221           -18.433

雖然這看起來相同，但原始列被賦予lm()，並且該函數創建了虛擬變量。因此，順序方差分析測試參數組以獲得 column-level p 值：

anova(un_encoded_fit)

## Analysis of Variance Table
## 
## Response: ..y
##                Df    Sum Sq  Mean Sq F value Pr(>F)    
## bill_depth_mm   1  48840779 48840779 369.782 <2e-16 ***
## species         2 126067249 63033624 477.239 <2e-16 ***
## island          2     20864    10432   0.079 0.9241    
## Residuals     336  44378805   132080                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

覆蓋默認模型公式

此外，傳遞給底層模型的公式也可以定製。在這種情況下，可以使用add_model() 的formula 參數。為了進行演示，將使用樣條函數來表示鈔票深度：

library(splines)

custom_formula <- workflow() %>%
  add_model(
    lm_mod, 
    formula = body_mass_g ~ species + island + ns(bill_depth_mm, 3)
  ) %>% 
  add_formula(
    body_mass_g ~ species + island + bill_depth_mm, 
    blueprint = minimal
  ) %>% 
  fit(data = penguins)

custom_parsnip_fit <- custom_formula %>% 
  extract_fit_parsnip()

custom_fit <- custom_parsnip_fit$fit

custom_fit

## 
## Call:
## stats::lm(formula = body_mass_g ~ species + island + ns(bill_depth_mm, 
##     3), data = data)
## 
## Coefficients:
##           (Intercept)       speciesChinstrap          speciesGentoo  
##              1959.090                  8.534               2352.137  
##           islandDream        islandTorgersen  ns(bill_depth_mm, 3)1  
##                 2.425                -12.002               1476.386  
## ns(bill_depth_mm, 3)2  ns(bill_depth_mm, 3)3  
##              3187.839               1686.996

改變公式

最後，當公式被更新或從擬合工作流程中刪除時，相應的模型擬合也會被刪除。

custom_formula_no_fit <- update_formula(custom_formula, body_mass_g ~ species)

try(extract_fit_parsnip(custom_formula_no_fit))

## Error in extract_fit_parsnip(custom_formula_no_fit) : 
##   Can't extract a model fit from an untrained workflow.
## i Do you need to call `fit()`?

例子

workflow <- workflow()
workflow <- add_formula(workflow, mpg ~ cyl)
workflow
#> ══ Workflow ══════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: None
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> mpg ~ cyl

remove_formula(workflow)
#> ══ Workflow ══════════════════════════════════════════════════════════════
#> Preprocessor: None
#> Model: None

update_formula(workflow, mpg ~ disp)
#> ══ Workflow ══════════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: None
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> mpg ~ disp

源代碼：R/pre-action-formula.R

相關用法

注：本文由純淨天空篩選整理自Davis Vaughan等大神的英文原創作品 Add formula terms to a workflow。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。