R textrecipes step_sequence_onehot 令牌的位置 One-Hot 编码

step_sequence_onehot() 创建配方步骤的规范，该步骤将采用字符串并按位置对每个字符进行一次热编码。

用法

step_sequence_onehot(
  recipe,
  ...,
  role = "predictor",
  trained = FALSE,
  columns = NULL,
  sequence_length = 100,
  padding = "pre",
  truncating = "pre",
  vocabulary = NULL,
  prefix = "seq1hot",
  keep_original_cols = FALSE,
  skip = FALSE,
  id = rand_id("sequence_onehot")
)

来源

https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf

参数

recipe: 一个recipe 对象。该步骤将添加到此配方的操作序列中。
...: 一个或多个选择器函数用于选择受该步骤影响的变量。有关更多详细信息，请参阅recipes::selections()。
role: 对于此步骤创建的模型项，应为它们分配什么分析角色？默认情况下，该函数假定由原始变量创建的新列将用作模型中的预测变量。
trained: 指示预处理数量是否已估计的逻辑。
columns: 将由 terms 参数(最终)填充的变量名称字符串。在 recipes::prep.recipe() 训练该步骤之前，这是 NULL 。
sequence_length: 一个数字，在丢弃之前要保留的字符数。默认为 100。
padding: 'pre' 或 'post'，在每个序列之前或之后填充。默认为'pre'。
truncating: 'pre' 或'post'，从序列开头或结尾处删除大于sequence_length 的值。默认值也是'pre'。
vocabulary: 字符向量，要映射到整数的字符。不在词汇表中的字符将被编码为 0。默认为 letters 。
prefix: 生成的列名称的前缀，默认为"seq1hot"。
keep_original_cols: 将原始变量保留在输出中的逻辑。默认为 FALSE 。
skip: 一个合乎逻辑的。当recipes::bake.recipe() 烘焙食谱时是否应该跳过此步骤？虽然所有操作都是在 recipes::prep.recipe() 运行时烘焙的，但某些操作可能无法对新数据进行(例如处理结果变量)。使用 skip = FALSE 时应小心。
id: 该步骤特有的字符串，用于标识它。

值

recipe 的更新版本，其中新步骤添加到现有步骤(如果有)的序列中。

细节

该字符串将由 sequence_length 参数限制，短于 sequence_length 的字符串将用空字符填充。编码将为词汇表中的每个字符分配一个整数，并进行相应的编码。不在词汇表中的字符将被编码为 0。

整理

当您tidy()此步骤时，会出现一个包含列terms(所选选择器或变量)、vocabulary(索引)和token(与索引对应的文本)的小标题。

箱重

底层操作不允许使用案例权重。

也可以看看

来自字符的数字变量的其他步骤：step_dummy_hash()、step_textfeature()

例子

library(recipes)
library(modeldata)
data(tate_text)

tate_rec <- recipe(~medium, data = tate_text) %>%
  step_tokenize(medium) %>%
  step_tokenfilter(medium) %>%
  step_sequence_onehot(medium)

tate_obj <- tate_rec %>%
  prep()

bake(tate_obj, new_data = NULL)
#> # A tibble: 4,284 × 100
#>    seq1hot_medium_1 seq1hot_medium_2 seq1hot_medium_3 seq1hot_medium_4
#>               <int>            <int>            <int>            <int>
#>  1                0                0                0                0
#>  2                0                0                0                0
#>  3                0                0                0                0
#>  4                0                0                0                0
#>  5                0                0                0                0
#>  6                0                0                0                0
#>  7                0                0                0                0
#>  8                0                0                0                0
#>  9                0                0                0                0
#> 10                0                0                0                0
#> # ℹ 4,274 more rows
#> # ℹ 96 more variables: seq1hot_medium_5 <int>, seq1hot_medium_6 <int>,
#> #   seq1hot_medium_7 <int>, seq1hot_medium_8 <int>,
#> #   seq1hot_medium_9 <int>, seq1hot_medium_10 <int>,
#> #   seq1hot_medium_11 <int>, seq1hot_medium_12 <int>,
#> #   seq1hot_medium_13 <int>, seq1hot_medium_14 <int>,
#> #   seq1hot_medium_15 <int>, seq1hot_medium_16 <int>, …

tidy(tate_rec, number = 3)
#> # A tibble: 1 × 4
#>   terms  vocabulary token id                   
#>   <chr>  <chr>      <int> <chr>                
#> 1 medium NA            NA sequence_onehot_bH08q
tidy(tate_obj, number = 3)
#> # A tibble: 100 × 4
#>    terms  vocabulary token     id                   
#>    <chr>       <int> <chr>     <chr>                
#>  1 medium          1 16        sequence_onehot_bH08q
#>  2 medium          2 2         sequence_onehot_bH08q
#>  3 medium          3 3         sequence_onehot_bH08q
#>  4 medium          4 35        sequence_onehot_bH08q
#>  5 medium          5 4         sequence_onehot_bH08q
#>  6 medium          6 5         sequence_onehot_bH08q
#>  7 medium          7 6         sequence_onehot_bH08q
#>  8 medium          8 8         sequence_onehot_bH08q
#>  9 medium          9 acrylic   sequence_onehot_bH08q
#> 10 medium         10 aluminium sequence_onehot_bH08q
#> # ℹ 90 more rows

源代码：R/sequence_onehot.R

相关用法

注：本文由纯净天空筛选整理自等大神的英文原创作品 Positional One-Hot encoding of Tokens。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。