R hardhat default_xy_blueprint 默认 XY 蓝图

此页面包含 XY 预处理蓝图的详细信息。如果单独提供 x 和 y(即使用 XY 接口)，则这是 mold() 默认使用的蓝图。

用法

default_xy_blueprint(
  intercept = FALSE,
  allow_novel_levels = FALSE,
  composition = "tibble"
)

# S3 method for data.frame
mold(x, y, ..., blueprint = NULL)

# S3 method for matrix
mold(x, y, ..., blueprint = NULL)

参数

intercept: 一个合乎逻辑的。处理的数据中是否应该包含拦截？该信息由mold 和forge 函数列表中的process 函数使用。
allow_novel_levels: 一个合乎逻辑的。在预测时是否应该允许新的因子水平？此信息由 forge 函数列表中的 clean 函数使用，并传递给 scream() 。
composition: "tibble"、"matrix" 或 "dgCMatrix" 用于已处理预测变量的格式。如果选择 "matrix" 或 "dgCMatrix"，则在应用预处理方法后，所有预测变量都必须为数值；否则会抛出错误。
x: 包含预测变量的 DataFrame 或矩阵。
y: 包含结果的 DataFrame 、矩阵或向量。
...: 不曾用过。
blueprint: 预处理blueprint。如果保留为NULL，则使用default_xy_blueprint()。

值

对于 default_xy_blueprint() ，一个 XY 蓝图。

细节

如 standardize() 中所述，如果 y 是向量，则返回的结果 tibble 有 1 列，其标准化名称为 ".outcome" 。

XY 方法的 forge 函数的一个特殊之处是，当向 mold() 的原始调用提供向量 y 值时，outcomes = TRUE 的行为。在这种情况下， mold() 将 y 转换为 tibble，默认名称为 .outcome 。这是 forge() 将在 new_data 中查找以进行预处理的列。请参阅示例部分以了解这一点的演示。

模具

当 mold() 与默认 xy 蓝图一起使用时：

它将 x 转换为 tibble。
如果 intercept = TRUE ，它将截距列添加到 x 。
它在 y 上运行 standardize() 。

锻造

当 forge() 与默认 xy 蓝图一起使用时：

它调用 shrink() 将 new_data 修剪为仅所需的列，并将 new_data 强制为 tibble。
它调用 scream() 对 new_data 的列结构进行验证。
如果 intercept = TRUE ，它将截距列添加到 new_data 上。

例子

# ---------------------------------------------------------------------------
# Setup

train <- iris[1:100, ]
test <- iris[101:150, ]

train_x <- train["Sepal.Length"]
train_y <- train["Species"]

test_x <- test["Sepal.Length"]
test_y <- test["Species"]

# ---------------------------------------------------------------------------
# XY Example

# First, call mold() with the training data
processed <- mold(train_x, train_y)

# Then, call forge() with the blueprint and the test data
# to have it preprocess the test data in the same way
forge(test_x, processed$blueprint)
#> $predictors
#> # A tibble: 50 × 1
#>    Sepal.Length
#>           <dbl>
#>  1          6.3
#>  2          5.8
#>  3          7.1
#>  4          6.3
#>  5          6.5
#>  6          7.6
#>  7          4.9
#>  8          7.3
#>  9          6.7
#> 10          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> NULL
#> 

# ---------------------------------------------------------------------------
# Intercept

processed <- mold(train_x, train_y, blueprint = default_xy_blueprint(intercept = TRUE))

forge(test_x, processed$blueprint)
#> $predictors
#> # A tibble: 50 × 2
#>    `(Intercept)` Sepal.Length
#>            <int>        <dbl>
#>  1             1          6.3
#>  2             1          5.8
#>  3             1          7.1
#>  4             1          6.3
#>  5             1          6.5
#>  6             1          7.6
#>  7             1          4.9
#>  8             1          7.3
#>  9             1          6.7
#> 10             1          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> NULL
#> 

# ---------------------------------------------------------------------------
# XY Method and forge(outcomes = TRUE)

# You can request that the new outcome columns are preprocessed as well, but
# they have to be present in `new_data`!

processed <- mold(train_x, train_y)

# Can't do this!
try(forge(test_x, processed$blueprint, outcomes = TRUE))
#> Error in validate_column_names(data, cols) : 
#>   The following required columns are missing: 'Species'.

# Need to use the full test set, including `y`
forge(test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 50 × 1
#>    Sepal.Length
#>           <dbl>
#>  1          6.3
#>  2          5.8
#>  3          7.1
#>  4          6.3
#>  5          6.5
#>  6          7.6
#>  7          4.9
#>  8          7.3
#>  9          6.7
#> 10          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> # A tibble: 50 × 1
#>    Species  
#>    <fct>    
#>  1 virginica
#>  2 virginica
#>  3 virginica
#>  4 virginica
#>  5 virginica
#>  6 virginica
#>  7 virginica
#>  8 virginica
#>  9 virginica
#> 10 virginica
#> # ℹ 40 more rows
#> 
#> $extras
#> NULL
#> 

# With the XY method, if the Y value used in `mold()` is a vector,
# then a column name of `.outcome` is automatically generated.
# This name is what forge() looks for in `new_data`.

# Y is a vector!
y_vec <- train_y$Species

processed_vec <- mold(train_x, y_vec)

# This throws an informative error that tell you
# to include an `".outcome"` column in `new_data`.
try(forge(iris, processed_vec$blueprint, outcomes = TRUE))
#> Error in validate_missing_name_isnt_.outcome(check$missing_names) : 
#>   The following required columns are missing: '.outcome'.
#> 
#> (This indicates that `mold()` was called with a vector for `y`. When this is the case, and the outcome columns are requested in `forge()`, `new_data` must include a column with the automatically generated name, '.outcome', containing the outcome.)

test2 <- test
test2$.outcome <- test2$Species
test2$Species <- NULL

# This works, and returns a tibble in the $outcomes slot
forge(test2, processed_vec$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 50 × 1
#>    Sepal.Length
#>           <dbl>
#>  1          6.3
#>  2          5.8
#>  3          7.1
#>  4          6.3
#>  5          6.5
#>  6          7.6
#>  7          4.9
#>  8          7.3
#>  9          6.7
#> 10          7.2
#> # ℹ 40 more rows
#> 
#> $outcomes
#> # A tibble: 50 × 1
#>    .outcome 
#>    <fct>    
#>  1 virginica
#>  2 virginica
#>  3 virginica
#>  4 virginica
#>  5 virginica
#>  6 virginica
#>  7 virginica
#>  8 virginica
#>  9 virginica
#> 10 virginica
#> # ℹ 40 more rows
#> 
#> $extras
#> NULL
#> 

# ---------------------------------------------------------------------------
# Matrix output for predictors

# You can change the `composition` of the predictor data set
bp <- default_xy_blueprint(composition = "dgCMatrix")
processed <- mold(train_x, train_y, blueprint = bp)
class(processed$predictors)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

源代码：R/blueprint-xy-default.R、R/mold.R

相关用法

注：本文由纯净天空筛选整理自Davis Vaughan等大神的英文原创作品 Default XY blueprint。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。