R hardhat validate_column_names 确保数据包含所需的列名

验证 - 断言以下内容：

data 的列名称必须包含全部 original_names 。

检查 - 返回以下内容：

ok 逻辑。检查通过吗？
missing_names 字符向量。缺少的列名称。

用法

validate_column_names(data, original_names)

check_column_names(data, original_names)

参数

data: 要检查的 DataFrame 。
original_names: 字符向量。原始列名称。

值

validate_column_names() 以不可见方式返回data。

check_column_names() 返回两个组件的命名列表： ok 和 missing_names 。

细节

如果缺少的列名为 ".outcome" ，则会引发特殊错误。这种情况仅发生在使用 xy-method 调用 mold() 且提供矢量 y 值而不是数据帧或矩阵的情况下。在这种情况下， y 被强制为数据帧，并添加自动名称 ".outcome" ，这就是在 forge() 中查找的内容。如果发生这种情况，并且用户尝试使用 forge(..., outcomes = TRUE) 请求结果，但提供的 new_data 不包含所需的 ".outcome" 列，则会抛出一个特殊错误，告诉他们该怎么做。请参阅示例！

验证

Hardhat 提供两个级别的验证函数。

check_*() ：检查条件，并返回列表。该列表始终包含至少一个元素 ok ，这是一个指定检查是否通过的逻辑。每个检查还在返回的列表中检查特定元素，可用于构造有意义的错误消息。
validate_*()：检查条件，如果不通过则出错。这些函数调用相应的检查函数，然后提供默认的错误消息。如果您作为开发人员想要不同的错误消息，请自行调用 check_*() 函数，并提供您自己的验证函数。

也可以看看

其他验证函数：validate_no_formula_duplication()、validate_outcomes_are_binary()、validate_outcomes_are_factors()、validate_outcomes_are_numeric()、validate_outcomes_are_univariate()、validate_prediction_size()、validate_predictors_are_numeric()

例子

# ---------------------------------------------------------------------------

original_names <- colnames(mtcars)

test <- mtcars
bad_test <- test[, -c(3, 4)]

# All good
check_column_names(test, original_names)
#> $ok
#> [1] TRUE
#> 
#> $missing_names
#> character(0)
#> 

# Missing 2 columns
check_column_names(bad_test, original_names)
#> $ok
#> [1] FALSE
#> 
#> $missing_names
#> [1] "disp" "hp"  
#> 

# Will error
try(validate_column_names(bad_test, original_names))
#> Error in validate_column_names(bad_test, original_names) : 
#>   The following required columns are missing: 'disp', 'hp'.

# ---------------------------------------------------------------------------
# Special error when `.outcome` is missing

train <- iris[1:100, ]
test <- iris[101:150, ]

train_x <- subset(train, select = -Species)
train_y <- train$Species

# Here, y is a vector
processed <- mold(train_x, train_y)

# So the default column name is `".outcome"`
processed$outcomes
#> # A tibble: 100 × 1
#>    .outcome
#>    <fct>   
#>  1 setosa  
#>  2 setosa  
#>  3 setosa  
#>  4 setosa  
#>  5 setosa  
#>  6 setosa  
#>  7 setosa  
#>  8 setosa  
#>  9 setosa  
#> 10 setosa  
#> # ℹ 90 more rows

# It doesn't affect forge() normally
forge(test, processed$blueprint)
#> $predictors
#> # A tibble: 50 × 4
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>           <dbl>       <dbl>        <dbl>       <dbl>
#>  1          6.3         3.3          6           2.5
#>  2          5.8         2.7          5.1         1.9
#>  3          7.1         3            5.9         2.1
#>  4          6.3         2.9          5.6         1.8
#>  5          6.5         3            5.8         2.2
#>  6          7.6         3            6.6         2.1
#>  7          4.9         2.5          4.5         1.7
#>  8          7.3         2.9          6.3         1.8
#>  9          6.7         2.5          5.8         1.8
#> 10          7.2         3.6          6.1         2.5
#> # ℹ 40 more rows
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> NULL
#> 

# But if the outcome is requested, and `".outcome"`
# is not present in `new_data`, an error is thrown
# with very specific instructions
try(forge(test, processed$blueprint, outcomes = TRUE))
#> Error in validate_missing_name_isnt_.outcome(check$missing_names) : 
#>   The following required columns are missing: '.outcome'.
#> 
#> (This indicates that `mold()` was called with a vector for `y`. When this is the case, and the outcome columns are requested in `forge()`, `new_data` must include a column with the automatically generated name, '.outcome', containing the outcome.)

# To get this to work, just create an .outcome column in new_data
test$.outcome <- test$Species

forge(test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 50 × 4
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>           <dbl>       <dbl>        <dbl>       <dbl>
#>  1          6.3         3.3          6           2.5
#>  2          5.8         2.7          5.1         1.9
#>  3          7.1         3            5.9         2.1
#>  4          6.3         2.9          5.6         1.8
#>  5          6.5         3            5.8         2.2
#>  6          7.6         3            6.6         2.1
#>  7          4.9         2.5          4.5         1.7
#>  8          7.3         2.9          6.3         1.8
#>  9          6.7         2.5          5.8         1.8
#> 10          7.2         3.6          6.1         2.5
#> # ℹ 40 more rows
#> 
#> $outcomes
#> # A tibble: 50 × 1
#>    .outcome 
#>    <fct>    
#>  1 virginica
#>  2 virginica
#>  3 virginica
#>  4 virginica
#>  5 virginica
#>  6 virginica
#>  7 virginica
#>  8 virginica
#>  9 virginica
#> 10 virginica
#> # ℹ 40 more rows
#> 
#> $extras
#> NULL
#>

源代码：R/validation.R

相关用法

注：本文由纯净天空筛选整理自Davis Vaughan等大神的英文原创作品 Ensure that data contains required column names。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。