R hardhat default_xy_blueprint 默認 XY 藍圖

此頁麵包含 XY 預處理藍圖的詳細信息。如果單獨提供 x 和 y(即使用 XY 接口)，則這是 mold() 默認使用的藍圖。

用法

default_xy_blueprint(
  intercept = FALSE,
  allow_novel_levels = FALSE,
  composition = "tibble"
)

# S3 method for data.frame
mold(x, y, ..., blueprint = NULL)

# S3 method for matrix
mold(x, y, ..., blueprint = NULL)

參數

intercept: 一個合乎邏輯的。處理的數據中是否應該包含攔截？該信息由mold 和forge 函數列表中的process 函數使用。
allow_novel_levels: 一個合乎邏輯的。在預測時是否應該允許新的因子水平？此信息由 forge 函數列表中的 clean 函數使用，並傳遞給 scream() 。
composition: "tibble"、"matrix" 或 "dgCMatrix" 用於已處理預測變量的格式。如果選擇 "matrix" 或 "dgCMatrix"，則在應用預處理方法後，所有預測變量都必須為數值；否則會拋出錯誤。
x: 包含預測變量的 DataFrame 或矩陣。
y: 包含結果的 DataFrame 、矩陣或向量。
...: 不曾用過。
blueprint: 預處理blueprint。如果保留為NULL，則使用default_xy_blueprint()。

值

對於 default_xy_blueprint() ，一個 XY 藍圖。

細節

如 standardize() 中所述，如果 y 是向量，則返回的結果 tibble 有 1 列，其標準化名稱為 ".outcome" 。

XY 方法的 forge 函數的一個特殊之處是，當向 mold() 的原始調用提供向量 y 值時，outcomes = TRUE 的行為。在這種情況下， mold() 將 y 轉換為 tibble，默認名稱為 .outcome 。這是 forge() 將在 new_data 中查找以進行預處理的列。請參閱示例部分以了解這一點的演示。

模具

當 mold() 與默認 xy 藍圖一起使用時：

它將 x 轉換為 tibble。
如果 intercept = TRUE ，它將截距列添加到 x 。
它在 y 上運行 standardize() 。

鍛造

當 forge() 與默認 xy 藍圖一起使用時：

它調用 shrink() 將 new_data 修剪為僅所需的列，並將 new_data 強製為 tibble。
它調用 scream() 對 new_data 的列結構進行驗證。
如果 intercept = TRUE ，它將截距列添加到 new_data 上。

例子

# ---------------------------------------------------------------------------
# Setup

train <- iris[1:100, ]
test <- iris[101:150, ]

train_x <- train["Sepal.Length"]
train_y <- train["Species"]

test_x <- test["Sepal.Length"]
test_y <- test["Species"]

# ---------------------------------------------------------------------------
# XY Example

# First, call mold() with the training data
processed <- mold(train_x, train_y)

# Then, call forge() with the blueprint and the test data
# to have it preprocess the test data in the same way
forge(test_x, processed$blueprint)
#> $predictors
#> # A tibble: 50 × 1
#>    Sepal.Length
#>           <dbl>
#>  1          6.3
#>  2          5.8
#>  3          7.1
#>  4          6.3
#>  5          6.5
#>  6          7.6
#>  7          4.9
#>  8          7.3
#>  9          6.7
#> 10          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> NULL
#> 

# ---------------------------------------------------------------------------
# Intercept

processed <- mold(train_x, train_y, blueprint = default_xy_blueprint(intercept = TRUE))

forge(test_x, processed$blueprint)
#> $predictors
#> # A tibble: 50 × 2
#>    `(Intercept)` Sepal.Length
#>            <int>        <dbl>
#>  1             1          6.3
#>  2             1          5.8
#>  3             1          7.1
#>  4             1          6.3
#>  5             1          6.5
#>  6             1          7.6
#>  7             1          4.9
#>  8             1          7.3
#>  9             1          6.7
#> 10             1          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> NULL
#> 

# ---------------------------------------------------------------------------
# XY Method and forge(outcomes = TRUE)

# You can request that the new outcome columns are preprocessed as well, but
# they have to be present in `new_data`!

processed <- mold(train_x, train_y)

# Can't do this!
try(forge(test_x, processed$blueprint, outcomes = TRUE))
#> Error in validate_column_names(data, cols) : 
#>   The following required columns are missing: 'Species'.

# Need to use the full test set, including `y`
forge(test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 50 × 1
#>    Sepal.Length
#>           <dbl>
#>  1          6.3
#>  2          5.8
#>  3          7.1
#>  4          6.3
#>  5          6.5
#>  6          7.6
#>  7          4.9
#>  8          7.3
#>  9          6.7
#> 10          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> # A tibble: 50 × 1
#>    Species  
#>    <fct>    
#>  1 virginica
#>  2 virginica
#>  3 virginica
#>  4 virginica
#>  5 virginica
#>  6 virginica
#>  7 virginica
#>  8 virginica
#>  9 virginica
#> 10 virginica
#> # ℹ 40 more rows
#> 
#> $extras
#> NULL
#> 

# With the XY method, if the Y value used in `mold()` is a vector,
# then a column name of `.outcome` is automatically generated.
# This name is what forge() looks for in `new_data`.

# Y is a vector!
y_vec <- train_y$Species

processed_vec <- mold(train_x, y_vec)

# This throws an informative error that tell you
# to include an `".outcome"` column in `new_data`.
try(forge(iris, processed_vec$blueprint, outcomes = TRUE))
#> Error in validate_missing_name_isnt_.outcome(check$missing_names) : 
#>   The following required columns are missing: '.outcome'.
#> 
#> (This indicates that `mold()` was called with a vector for `y`. When this is the case, and the outcome columns are requested in `forge()`, `new_data` must include a column with the automatically generated name, '.outcome', containing the outcome.)

test2 <- test
test2$.outcome <- test2$Species
test2$Species <- NULL

# This works, and returns a tibble in the $outcomes slot
forge(test2, processed_vec$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 50 × 1
#>    Sepal.Length
#>           <dbl>
#>  1          6.3
#>  2          5.8
#>  3          7.1
#>  4          6.3
#>  5          6.5
#>  6          7.6
#>  7          4.9
#>  8          7.3
#>  9          6.7
#> 10          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> # A tibble: 50 × 1
#>    .outcome 
#>    <fct>    
#>  1 virginica
#>  2 virginica
#>  3 virginica
#>  4 virginica
#>  5 virginica
#>  6 virginica
#>  7 virginica
#>  8 virginica
#>  9 virginica
#> 10 virginica
#> # ℹ 40 more rows
#> 
#> $extras
#> NULL
#> 

# ---------------------------------------------------------------------------
# Matrix output for predictors

# You can change the `composition` of the predictor data set
bp <- default_xy_blueprint(composition = "dgCMatrix")
processed <- mold(train_x, train_y, blueprint = bp)
class(processed$predictors)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

源代碼：R/blueprint-xy-default.R、R/mold.R

相關用法

注：本文由純淨天空篩選整理自Davis Vaughan等大神的英文原創作品 Default XY blueprint。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。