R recipes discretize 離散數值變量

discretize() 將數值向量轉換為具有大致相同數據點數量的箱的因子(基於訓練集)。

用法

discretize(x, ...)

# S3 method for default
discretize(x, ...)

# S3 method for numeric
discretize(
  x,
  cuts = 4,
  labels = NULL,
  prefix = "bin",
  keep_na = TRUE,
  infs = TRUE,
  min_unique = 10,
  ...
)

# S3 method for discretize
predict(object, new_data, ...)

參數

x: 數值向量
...: 傳遞給 stats::quantile() 的選項不應包含 x 或 probs 。
cuts: 一個整數，定義對數據進行多少次切割。
labels: 定義新因子中的因子級別的字符向量(從最小到最大)。該長度應為cuts+1，並且不應包含缺失級別(請參閱下麵的keep_na)。
prefix: 用作因子級別前綴的單個參數值(例如 bin1 、 bin2 等)。如果字符串不是有效的 R 名稱，則會將其強製為 1。如果是 prefix = NULL ，則因子級別將根據 cut() 的輸出進行標記。
keep_na: 是否應創建因子級別來識別 x 中缺失值的邏輯。如果 keep_na 設置為 TRUE，則在調用 stats::quantile() 時使用 na.rm = TRUE。
infs: 指示最小和最大分割點是否應該是無限的邏輯。
min_unique: 定義分箱尊嚴的樣本大小線的整數。如果(唯一值的數量) /(cuts+1) 小於 min_unique ，則不會發生離散化。
object: 類 discretize 的對象。
new_data: 要分箱的新數字對象。

值

discretize 返回類 discretize 的對象，predict.discretize 返回因子向量。

細節

discretize 使用百分位數估計 x 的分割點。例如，如果 cuts = 3 ，該函數會估計 x 的四分位數並將其用作分割點。如果 cuts = 2 ，則 bin 被定義為高於或低於 x 的中值。

然後可以使用 predict 方法將數值向量轉換為因子向量。

如果是 keep_na = TRUE ，則後綴 "_missing" 將用作因子級別(請參閱下麵的示例)。

如果 infs = FALSE 和新值大於 x 的最大值，則會導致缺失值。

例子

data(biomass, package = "modeldata")

biomass_tr <- biomass[biomass$dataset == "Training", ]
biomass_te <- biomass[biomass$dataset == "Testing", ]

median(biomass_tr$carbon)
#> [1] 47.1
discretize(biomass_tr$carbon, cuts = 2)
#> Bins: 3 (includes missing category)
#> Breaks: -Inf, 47.1, Inf
discretize(biomass_tr$carbon, cuts = 2, infs = FALSE)
#> Bins: 3 (includes missing category)
#> Breaks: 14.61, 47.1, 97.18
discretize(biomass_tr$carbon, cuts = 2, infs = FALSE, keep_na = FALSE)
#> Bins: 2
#> Breaks: 14.61, 47.1, 97.18
discretize(biomass_tr$carbon, cuts = 2, prefix = "maybe a bad idea to bin")
#> Warning: The prefix 'maybe a bad idea to bin' is not a valid R name. It has been changed to 'maybe.a.bad.idea.to.bin'.
#> Bins: 3 (includes missing category)
#> Breaks: -Inf, 47.1, Inf

carbon_binned <- discretize(biomass_tr$carbon)
table(predict(carbon_binned, biomass_tr$carbon))
#> 
#> bin1 bin2 bin3 bin4 
#>  114  115  113  114 

carbon_no_infs <- discretize(biomass_tr$carbon, infs = FALSE)
predict(carbon_no_infs, c(50, 100))
#> [1] bin4 <NA>
#> Levels: bin1 bin2 bin3 bin4

源代碼：R/discretize.R

相關用法

注：本文由純淨天空篩選整理自Max Kuhn等大神的英文原創作品 Discretize Numeric Variables。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。