R recipes prep 估計預處理配方

對於至少具有一個預處理操作的配方，從訓練集中估計所需的參數，然後將其應用於其他數據集。

用法

prep(x, ...)

# S3 method for recipe
prep(
  x,
  training = NULL,
  fresh = FALSE,
  verbose = FALSE,
  retain = TRUE,
  log_changes = FALSE,
  strings_as_factors = TRUE,
  ...
)

參數

x: 一個東西
...: 傳入或傳出其他方法的更多參數(當前未使用)。
training: 將用於估計預處理參數的 DataFrame 或標題。
fresh: 指示是否應重新訓練已訓練操作的邏輯。如果是 TRUE ，您應該將數據集傳遞給參數 training 。
verbose: 控製是否在執行操作時報告進度的邏輯。
retain: 邏輯：訓練後預處理的訓練集是否應該保存到配方的template槽中？如果您想稍後添加更多步驟但希望避免重新訓練現有步驟，這是一個好主意。此外，如果任何步驟使用選項 skip = FALSE ，建議使用 retain = TRUE 。請注意，這可能會使最終的配方尺寸變大。當 verbose = TRUE 時，消息是用內存中的近似對象大小寫入的，但可能會低估，因為它沒有考慮環境。
log_changes: 用於打印每個步驟的摘要的邏輯，該摘要涉及訓練期間添加或刪除的列(如果有)。
strings_as_factors: 邏輯：字符列應該轉換為因子嗎？這會影響預處理的訓練集(當 retain = TRUE 時)以及 bake.recipe 的結果。

值

配方的步驟對象已更新為所需數量(例如參數估計、模型對象等)。此外，term_info 對象可能會在執行操作時被修改。

細節

給定一個數據集，該函數會估計任何操作所需的數量和統計數據。 prep() 返回包含估計值的更新配方。如果您使用配方作為建模的預處理器，我們強烈建議您使用 workflow() 而不是手動估計配方(請參閱 recipe() 中的示例)。

請注意，丟失的數據是在步驟中處理的；配方級別或 prep() 中沒有全局 na.rm 選項。

此外，如果使用 prep() 訓練配方，然後添加步驟，prep() 將僅更新新操作。如果是 fresh = TRUE ，所有操作都將被(重新)估計。

執行這些步驟時，training 集會更新。例如，如果第一步是使數據居中，第二步是縮放數據，則縮放步驟將給出居中的數據。

例子

data(ames, package = "modeldata")

library(dplyr)

ames <- mutate(ames, Sale_Price = log10(Sale_Price))

ames_rec <-
  recipe(
    Sale_Price ~ Longitude + Latitude + Neighborhood + Year_Built + Central_Air,
    data = ames
  ) %>%
  step_other(Neighborhood, threshold = 0.05) %>%
  step_dummy(all_nominal()) %>%
  step_interact(~ starts_with("Central_Air"):Year_Built) %>%
  step_ns(Longitude, Latitude, deg_free = 5)

prep(ames_rec, verbose = TRUE)
#> oper 1 step other [training] 
#> oper 2 step dummy [training] 
#> oper 3 step interact [training] 
#> oper 4 step ns [training] 
#> The retained training set is ~ 0.48 Mb  in memory.
#> 
#> 
#> ── Recipe ────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:   1
#> predictor: 5
#> 
#> ── Training information 
#> Training data contained 2930 data points and no incomplete rows.
#> 
#> ── Operations 
#> • Collapsing factor levels for: Neighborhood | Trained
#> • Dummy variables from: Neighborhood, Central_Air | Trained
#> • Interactions with: Central_Air_Y:Year_Built | Trained
#> • Natural splines on: Longitude, Latitude | Trained

prep(ames_rec, log_changes = TRUE)
#> step_other (other_AAhJh): same number of columns
#> 
#> step_dummy (dummy_kTnm6): 
#>  new (9): Neighborhood_College_Creek, Neighborhood_Old_Town, ...
#>  removed (2): Neighborhood, Central_Air
#> 
#> step_interact (interact_95l5E): 
#>  new (1): Central_Air_Y_x_Year_Built
#> 
#> step_ns (ns_FzcMd): 
#>  new (10): Longitude_ns_1, Longitude_ns_2, Longitude_ns_3, ...
#>  removed (2): Longitude, Latitude
#> 
#> 
#> ── Recipe ────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> outcome:   1
#> predictor: 5
#> 
#> ── Training information 
#> Training data contained 2930 data points and no incomplete rows.
#> 
#> ── Operations 
#> • Collapsing factor levels for: Neighborhood | Trained
#> • Dummy variables from: Neighborhood, Central_Air | Trained
#> • Interactions with: Central_Air_Y:Year_Built | Trained
#> • Natural splines on: Longitude, Latitude | Trained

源代碼：R/recipe.R

相關用法

注：本文由純淨天空篩選整理自Max Kuhn等大神的英文原創作品 Estimate a preprocessing recipe。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。