R hardhat validate_column_names 確保數據包含所需的列名

驗證 - 斷言以下內容：

data 的列名稱必須包含全部 original_names 。

檢查 - 返回以下內容：

ok 邏輯。檢查通過嗎？
missing_names 字符向量。缺少的列名稱。

用法

validate_column_names(data, original_names)

check_column_names(data, original_names)

參數

data: 要檢查的 DataFrame 。
original_names: 字符向量。原始列名稱。

值

validate_column_names() 以不可見方式返回data。

check_column_names() 返回兩個組件的命名列表： ok 和 missing_names 。

細節

如果缺少的列名為 ".outcome" ，則會引發特殊錯誤。這種情況僅發生在使用 xy-method 調用 mold() 且提供矢量 y 值而不是數據幀或矩陣的情況下。在這種情況下， y 被強製為數據幀，並添加自動名稱 ".outcome" ，這就是在 forge() 中查找的內容。如果發生這種情況，並且用戶嘗試使用 forge(..., outcomes = TRUE) 請求結果，但提供的 new_data 不包含所需的 ".outcome" 列，則會拋出一個特殊錯誤，告訴他們該怎麽做。請參閱示例！

驗證

Hardhat 提供兩個級別的驗證函數。

check_*() ：檢查條件，並返回列表。該列表始終包含至少一個元素 ok ，這是一個指定檢查是否通過的邏輯。每個檢查還在返回的列表中檢查特定元素，可用於構造有意義的錯誤消息。
validate_*()：檢查條件，如果不通過則出錯。這些函數調用相應的檢查函數，然後提供默認的錯誤消息。如果您作為開發人員想要不同的錯誤消息，請自行調用 check_*() 函數，並提供您自己的驗證函數。

也可以看看

其他驗證函數：validate_no_formula_duplication()、validate_outcomes_are_binary()、validate_outcomes_are_factors()、validate_outcomes_are_numeric()、validate_outcomes_are_univariate()、validate_prediction_size()、validate_predictors_are_numeric()

例子

# ---------------------------------------------------------------------------

original_names <- colnames(mtcars)

test <- mtcars
bad_test <- test[, -c(3, 4)]

# All good
check_column_names(test, original_names)
#> $ok
#> [1] TRUE
#> 
#> $missing_names
#> character(0)
#> 

# Missing 2 columns
check_column_names(bad_test, original_names)
#> $ok
#> [1] FALSE
#> 
#> $missing_names
#> [1] "disp" "hp"  
#> 

# Will error
try(validate_column_names(bad_test, original_names))
#> Error in validate_column_names(bad_test, original_names) : 
#>   The following required columns are missing: 'disp', 'hp'.

# ---------------------------------------------------------------------------
# Special error when `.outcome` is missing

train <- iris[1:100, ]
test <- iris[101:150, ]

train_x <- subset(train, select = -Species)
train_y <- train$Species

# Here, y is a vector
processed <- mold(train_x, train_y)

# So the default column name is `".outcome"`
processed$outcomes
#> # A tibble: 100 × 1
#>    .outcome
#>    <fct>   
#>  1 setosa  
#>  2 setosa  
#>  3 setosa  
#>  4 setosa  
#>  5 setosa  
#>  6 setosa  
#>  7 setosa  
#>  8 setosa  
#>  9 setosa  
#> 10 setosa  
#> # ℹ 90 more rows

# It doesn't affect forge() normally
forge(test, processed$blueprint)
#> $predictors
#> # A tibble: 50 × 4
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>           <dbl>       <dbl>        <dbl>       <dbl>
#>  1          6.3         3.3          6           2.5
#>  2          5.8         2.7          5.1         1.9
#>  3          7.1         3            5.9         2.1
#>  4          6.3         2.9          5.6         1.8
#>  5          6.5         3            5.8         2.2
#>  6          7.6         3            6.6         2.1
#>  7          4.9         2.5          4.5         1.7
#>  8          7.3         2.9          6.3         1.8
#>  9          6.7         2.5          5.8         1.8
#> 10          7.2         3.6          6.1         2.5
#> # ℹ 40 more rows
#> 
#> $outcomes
#> NULL
#> 
#> $extras
#> NULL
#> 

# But if the outcome is requested, and `".outcome"`
# is not present in `new_data`, an error is thrown
# with very specific instructions
try(forge(test, processed$blueprint, outcomes = TRUE))
#> Error in validate_missing_name_isnt_.outcome(check$missing_names) : 
#>   The following required columns are missing: '.outcome'.
#> 
#> (This indicates that `mold()` was called with a vector for `y`. When this is the case, and the outcome columns are requested in `forge()`, `new_data` must include a column with the automatically generated name, '.outcome', containing the outcome.)

# To get this to work, just create an .outcome column in new_data
test$.outcome <- test$Species

forge(test, processed$blueprint, outcomes = TRUE)
#> $predictors
#> # A tibble: 50 × 4
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width
#>           <dbl>       <dbl>        <dbl>       <dbl>
#>  1          6.3         3.3          6           2.5
#>  2          5.8         2.7          5.1         1.9
#>  3          7.1         3            5.9         2.1
#>  4          6.3         2.9          5.6         1.8
#>  5          6.5         3            5.8         2.2
#>  6          7.6         3            6.6         2.1
#>  7          4.9         2.5          4.5         1.7
#>  8          7.3         2.9          6.3         1.8
#>  9          6.7         2.5          5.8         1.8
#> 10          7.2         3.6          6.1         2.5
#> # ℹ 40 more rows
#> 
#> $outcomes
#> # A tibble: 50 × 1
#>    .outcome 
#>    <fct>    
#>  1 virginica
#>  2 virginica
#>  3 virginica
#>  4 virginica
#>  5 virginica
#>  6 virginica
#>  7 virginica
#>  8 virginica
#>  9 virginica
#> 10 virginica
#> # ℹ 40 more rows
#> 
#> $extras
#> NULL
#>

源代碼：R/validation.R

相關用法

注：本文由純淨天空篩選整理自Davis Vaughan等大神的英文原創作品 Ensure that data contains required column names。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。