R workflows workflow 創建工作流程

workflow 是一個容器對象，它聚合模型擬合和預測所需的信息。此信息可能是預處理中使用的配方(通過 add_recipe() 指定)，或者是要擬合的模型規範(通過 add_model() 指定)。

preprocessor 和 spec 參數允許您快速將組件添加到工作流程，而無需執行 add_*() 函數，例如 add_recipe() 或 add_model() 。但是，如果您需要控製這些函數的任何可選參數，例如 blueprint 或模型 formula ，那麽您應該直接使用 add_*() 函數。

用法

workflow(preprocessor = NULL, spec = NULL)

參數

preprocessor

添加到工作流程的可選預處理器。之一：

一個公式，傳遞給 add_formula() 。
一個配方，傳遞給add_recipe()。
workflow_variables() 對象，傳遞給 add_variables() 。

spec

添加到工作流程中的可選防風草模型規範。傳遞給add_model()。

值

一個新的 workflow 對象。

指標變量詳細信息

當您使用模型公式時，R 中的某些建模函數會根據分類數據創建指標/虛擬變量，而有些則不會。當您使用 workflow() 指定並擬合模型時，防風草和工作流程會匹配並重現用戶指定模型的計算引擎的基礎行為。

公式預處理器

在房地產價格modeldata::Sacramento數據集中，type變量具有三個級別："Residential"、"Condo"和"Multi-Family"。此基礎 workflow() 包含通過 add_formula() 添加的公式，用於根據房產類型、平方英尺、床位數量和浴室數量預測房產價格：

set.seed(123)

library(parsnip)
library(recipes)
library(workflows)
library(modeldata)

data("Sacramento")

base_wf <- workflow() %>%
  add_formula(price ~ type + sqft + beds + baths)

第一個模型確實創建了虛擬/指標變量：

lm_spec <- linear_reg() %>%
  set_engine("lm")

base_wf %>%
  add_model(lm_spec) %>%
  fit(Sacramento)

## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: linear_reg()
## 
## -- Preprocessor ------------------------------------------------------
## price ~ type + sqft + beds + baths
## 
## -- Model -------------------------------------------------------------
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##      (Intercept)  typeMulti_Family   typeResidential  
##          32919.4          -21995.8           33688.6  
##             sqft              beds             baths  
##            156.2          -29788.0            8730.0

此 OLS 線性回歸的擬合模型中有五個自變量。使用此模型類型和引擎，房地產的因子預測變量 type 轉換為兩個二元預測變量 typeMulti_Family 和 typeResidential 。 (第三種類型，對於公寓，不需要自己的列，因為它是基線水平)。

第二個模型不創建虛擬/指標變量：

rf_spec <- rand_forest() %>%
  set_mode("regression") %>%
  set_engine("ranger")

base_wf %>%
  add_model(rf_spec) %>%
  fit(Sacramento)

## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: rand_forest()
## 
## -- Preprocessor ------------------------------------------------------
## price ~ type + sqft + beds + baths
## 
## -- Model -------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      932 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       7058847504 
## R squared (OOB):                  0.5894647

請注意，該護林員隨機森林的擬合模型中有四個自變量。使用此模型類型和引擎，不會為正在出售的房地產的 type 創建指示變量。基於樹的模型(例如隨機森林模型)可以直接處理因子預測變量，並且不需要任何到數字二進製變量的轉換。

配方預處理器

當您通過 workflow() 指定模型並通過 add_recipe() 指定配方預處理器時，配方控製是否創建虛擬變量；該配方會覆蓋模型計算引擎的任何底層行為。

例子

library(parsnip)
library(recipes)
library(magrittr)
library(modeldata)

data("attrition")

model <- logistic_reg() %>%
  set_engine("glm")

formula <- Attrition ~ BusinessTravel + YearsSinceLastPromotion + OverTime

wf_formula <- workflow(formula, model)

fit(wf_formula, attrition)
#> ══ Workflow [trained] ════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> Attrition ~ BusinessTravel + YearsSinceLastPromotion + OverTime
#> 
#> ── Model ─────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#> 
#> Coefficients:
#>                     (Intercept)  BusinessTravelTravel_Frequently  
#>                        -2.82571                          1.29473  
#>     BusinessTravelTravel_Rarely          YearsSinceLastPromotion  
#>                         0.64727                         -0.03092  
#>                     OverTimeYes  
#>                         1.31904  
#> 
#> Degrees of Freedom: 1469 Total (i.e. Null);  1465 Residual
#> Null Deviance:	    1299 
#> Residual Deviance: 1194 	AIC: 1204

recipe <- recipe(Attrition ~ ., attrition) %>%
  step_dummy(all_nominal(), -Attrition) %>%
  step_corr(all_predictors(), threshold = 0.8)

wf_recipe <- workflow(recipe, model)

fit(wf_recipe, attrition)
#> ══ Workflow [trained] ════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> 2 Recipe Steps
#> 
#> • step_dummy()
#> • step_corr()
#> 
#> ── Model ─────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#> 
#> Coefficients:
#>                      (Intercept)                               Age  
#>                       -2.535e+00                        -3.131e-02  
#>                        DailyRate                  DistanceFromHome  
#>                       -3.126e-04                         4.927e-02  
#>                       HourlyRate                     MonthlyIncome  
#>                        2.762e-03                         1.127e-06  
#>                      MonthlyRate                NumCompaniesWorked  
#>                        1.663e-06                         1.956e-01  
#>                PercentSalaryHike                  StockOptionLevel  
#>                       -2.495e-02                        -1.968e-01  
#>                TotalWorkingYears             TrainingTimesLastYear  
#>                       -6.820e-02                        -1.863e-01  
#>                   YearsAtCompany                YearsInCurrentRole  
#>                        8.916e-02                        -1.371e-01  
#>          YearsSinceLastPromotion              YearsWithCurrManager  
#>                        1.849e-01                        -1.516e-01  
#> BusinessTravel_Travel_Frequently      BusinessTravel_Travel_Rarely  
#>                        1.940e+00                         1.080e+00  
#>                      Education_1                       Education_2  
#>                       -1.391e-01                        -2.753e-01  
#>                      Education_3                       Education_4  
#>                       -7.324e-02                         3.858e-02  
#>     EducationField_Life_Sciences          EducationField_Marketing  
#>                       -6.939e-01                        -2.212e-01  
#>           EducationField_Medical              EducationField_Other  
#>                       -7.210e-01                        -6.755e-01  
#>  EducationField_Technical_Degree         EnvironmentSatisfaction_1  
#>                        2.936e-01                        -9.501e-01  
#>        EnvironmentSatisfaction_2         EnvironmentSatisfaction_3  
#>                        4.383e-01                        -2.491e-01  
#>                      Gender_Male                  JobInvolvement_1  
#>                        4.243e-01                        -1.474e+00  
#>                 JobInvolvement_2                  JobInvolvement_3  
#>                        2.297e-01                        -2.855e-01  
#>          JobRole_Human_Resources     JobRole_Laboratory_Technician  
#>                        1.441e+00                         1.549e+00  
#>                  JobRole_Manager    JobRole_Manufacturing_Director  
#>                        1.900e-01                         3.726e-01  
#>        JobRole_Research_Director        JobRole_Research_Scientist  
#>                       -9.581e-01                         6.055e-01  
#>          JobRole_Sales_Executive      JobRole_Sales_Representative  
#>                        1.056e+00                         2.149e+00  
#>                JobSatisfaction_1                 JobSatisfaction_2  
#>                       -9.446e-01                        -8.929e-03  
#>                JobSatisfaction_3             MaritalStatus_Married  
#>                       -2.860e-01                         3.135e-01  
#> 
#> ...
#> and 14 more lines.

variables <- workflow_variables(
  Attrition,
  c(BusinessTravel, YearsSinceLastPromotion, OverTime)
)

wf_variables <- workflow(variables, model)

fit(wf_variables, attrition)
#> ══ Workflow [trained] ════════════════════════════════════════════════════
#> Preprocessor: Variables
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> Outcomes: Attrition
#> Predictors: c(BusinessTravel, YearsSinceLastPromotion, OverTime)
#> 
#> ── Model ─────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#> 
#> Coefficients:
#>                     (Intercept)  BusinessTravelTravel_Frequently  
#>                        -2.82571                          1.29473  
#>     BusinessTravelTravel_Rarely          YearsSinceLastPromotion  
#>                         0.64727                         -0.03092  
#>                     OverTimeYes  
#>                         1.31904  
#> 
#> Degrees of Freedom: 1469 Total (i.e. Null);  1465 Residual
#> Null Deviance:	    1299 
#> Residual Deviance: 1194 	AIC: 1204

源代碼：R/workflow.R

相關用法

注：本文由純淨天空篩選整理自Davis Vaughan等大神的英文原創作品 Create a workflow。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。