R workflows workflow 创建工作流程

workflow 是一个容器对象，它聚合模型拟合和预测所需的信息。此信息可能是预处理中使用的配方(通过 add_recipe() 指定)，或者是要拟合的模型规范(通过 add_model() 指定)。

preprocessor 和 spec 参数允许您快速将组件添加到工作流程，而无需执行 add_*() 函数，例如 add_recipe() 或 add_model() 。但是，如果您需要控制这些函数的任何可选参数，例如 blueprint 或模型 formula ，那么您应该直接使用 add_*() 函数。

用法

workflow(preprocessor = NULL, spec = NULL)

参数

preprocessor

添加到工作流程的可选预处理器。之一：

一个公式，传递给 add_formula() 。
一个配方，传递给add_recipe()。
workflow_variables() 对象，传递给 add_variables() 。

spec

添加到工作流程中的可选防风草模型规范。传递给add_model()。

值

一个新的 workflow 对象。

指标变量详细信息

当您使用模型公式时，R 中的某些建模函数会根据分类数据创建指标/虚拟变量，而有些则不会。当您使用 workflow() 指定并拟合模型时，防风草和工作流程会匹配并重现用户指定模型的计算引擎的基础行为。

公式预处理器

在房地产价格modeldata::Sacramento数据集中，type变量具有三个级别："Residential"、"Condo"和"Multi-Family"。此基础 workflow() 包含通过 add_formula() 添加的公式，用于根据房产类型、平方英尺、床位数量和浴室数量预测房产价格：

set.seed(123)

library(parsnip)
library(recipes)
library(workflows)
library(modeldata)

data("Sacramento")

base_wf <- workflow() %>%
  add_formula(price ~ type + sqft + beds + baths)

第一个模型确实创建了虚拟/指标变量：

lm_spec <- linear_reg() %>%
  set_engine("lm")

base_wf %>%
  add_model(lm_spec) %>%
  fit(Sacramento)

## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: linear_reg()
## 
## -- Preprocessor ------------------------------------------------------
## price ~ type + sqft + beds + baths
## 
## -- Model -------------------------------------------------------------
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##      (Intercept)  typeMulti_Family   typeResidential  
##          32919.4          -21995.8           33688.6  
##             sqft              beds             baths  
##            156.2          -29788.0            8730.0

此 OLS 线性回归的拟合模型中有五个自变量。使用此模型类型和引擎，房地产的因子预测变量 type 转换为两个二元预测变量 typeMulti_Family 和 typeResidential 。 (第三种类型，对于公寓，不需要自己的列，因为它是基线水平)。

第二个模型不创建虚拟/指标变量：

rf_spec <- rand_forest() %>%
  set_mode("regression") %>%
  set_engine("ranger")

base_wf %>%
  add_model(rf_spec) %>%
  fit(Sacramento)

## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: rand_forest()
## 
## -- Preprocessor ------------------------------------------------------
## price ~ type + sqft + beds + baths
## 
## -- Model -------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      932 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       7058847504 
## R squared (OOB):                  0.5894647

请注意，该护林员随机森林的拟合模型中有四个自变量。使用此模型类型和引擎，不会为正在出售的房地产的 type 创建指示变量。基于树的模型(例如随机森林模型)可以直接处理因子预测变量，并且不需要任何到数字二进制变量的转换。

配方预处理器

当您通过 workflow() 指定模型并通过 add_recipe() 指定配方预处理器时，配方控制是否创建虚拟变量；该配方会覆盖模型计算引擎的任何底层行为。

例子

library(parsnip)
library(recipes)
library(magrittr)
library(modeldata)

data("attrition")

model <- logistic_reg() %>%
  set_engine("glm")

formula <- Attrition ~ BusinessTravel + YearsSinceLastPromotion + OverTime

wf_formula <- workflow(formula, model)

fit(wf_formula, attrition)
#> ══ Workflow [trained] ════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> Attrition ~ BusinessTravel + YearsSinceLastPromotion + OverTime
#> 
#> ── Model ─────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#> 
#> Coefficients:
#>                     (Intercept)  BusinessTravelTravel_Frequently  
#>                        -2.82571                          1.29473  
#>     BusinessTravelTravel_Rarely          YearsSinceLastPromotion  
#>                         0.64727                         -0.03092  
#>                     OverTimeYes  
#>                         1.31904  
#> 
#> Degrees of Freedom: 1469 Total (i.e. Null);  1465 Residual
#> Null Deviance:	    1299 
#> Residual Deviance: 1194 	AIC: 1204

recipe <- recipe(Attrition ~ ., attrition) %>%
  step_dummy(all_nominal(), -Attrition) %>%
  step_corr(all_predictors(), threshold = 0.8)

wf_recipe <- workflow(recipe, model)

fit(wf_recipe, attrition)
#> ══ Workflow [trained] ════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> 2 Recipe Steps
#> 
#> • step_dummy()
#> • step_corr()
#> 
#> ── Model ─────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#> 
#> Coefficients:
#>                      (Intercept)                               Age  
#>                       -2.535e+00                        -3.131e-02  
#>                        DailyRate                  DistanceFromHome  
#>                       -3.126e-04                         4.927e-02  
#>                       HourlyRate                     MonthlyIncome  
#>                        2.762e-03                         1.127e-06  
#>                      MonthlyRate                NumCompaniesWorked  
#>                        1.663e-06                         1.956e-01  
#>                PercentSalaryHike                  StockOptionLevel  
#>                       -2.495e-02                        -1.968e-01  
#>                TotalWorkingYears             TrainingTimesLastYear  
#>                       -6.820e-02                        -1.863e-01  
#>                   YearsAtCompany                YearsInCurrentRole  
#>                        8.916e-02                        -1.371e-01  
#>          YearsSinceLastPromotion              YearsWithCurrManager  
#>                        1.849e-01                        -1.516e-01  
#> BusinessTravel_Travel_Frequently      BusinessTravel_Travel_Rarely  
#>                        1.940e+00                         1.080e+00  
#>                      Education_1                       Education_2  
#>                       -1.391e-01                        -2.753e-01  
#>                      Education_3                       Education_4  
#>                       -7.324e-02                         3.858e-02  
#>     EducationField_Life_Sciences          EducationField_Marketing  
#>                       -6.939e-01                        -2.212e-01  
#>           EducationField_Medical              EducationField_Other  
#>                       -7.210e-01                        -6.755e-01  
#>  EducationField_Technical_Degree         EnvironmentSatisfaction_1  
#>                        2.936e-01                        -9.501e-01  
#>        EnvironmentSatisfaction_2         EnvironmentSatisfaction_3  
#>                        4.383e-01                        -2.491e-01  
#>                      Gender_Male                  JobInvolvement_1  
#>                        4.243e-01                        -1.474e+00  
#>                 JobInvolvement_2                  JobInvolvement_3  
#>                        2.297e-01                        -2.855e-01  
#>          JobRole_Human_Resources     JobRole_Laboratory_Technician  
#>                        1.441e+00                         1.549e+00  
#>                  JobRole_Manager    JobRole_Manufacturing_Director  
#>                        1.900e-01                         3.726e-01  
#>        JobRole_Research_Director        JobRole_Research_Scientist  
#>                       -9.581e-01                         6.055e-01  
#>          JobRole_Sales_Executive      JobRole_Sales_Representative  
#>                        1.056e+00                         2.149e+00  
#>                JobSatisfaction_1                 JobSatisfaction_2  
#>                       -9.446e-01                        -8.929e-03  
#>                JobSatisfaction_3             MaritalStatus_Married  
#>                       -2.860e-01                         3.135e-01  
#> 
#> ...
#> and 14 more lines.

variables <- workflow_variables(
  Attrition,
  c(BusinessTravel, YearsSinceLastPromotion, OverTime)
)

wf_variables <- workflow(variables, model)

fit(wf_variables, attrition)
#> ══ Workflow [trained] ════════════════════════════════════════════════════
#> Preprocessor: Variables
#> Model: logistic_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────────
#> Outcomes: Attrition
#> Predictors: c(BusinessTravel, YearsSinceLastPromotion, OverTime)
#> 
#> ── Model ─────────────────────────────────────────────────────────────────
#> 
#> Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
#> 
#> Coefficients:
#>                     (Intercept)  BusinessTravelTravel_Frequently  
#>                        -2.82571                          1.29473  
#>     BusinessTravelTravel_Rarely          YearsSinceLastPromotion  
#>                         0.64727                         -0.03092  
#>                     OverTimeYes  
#>                         1.31904  
#> 
#> Degrees of Freedom: 1469 Total (i.e. Null);  1465 Residual
#> Null Deviance:	    1299 
#> Residual Deviance: 1194 	AIC: 1204

源代码：R/workflow.R

相关用法

注：本文由纯净天空筛选整理自Davis Vaughan等大神的英文原创作品 Create a workflow。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。