R textrecipes step_lda 計算代幣的LDA維度估計

step_lda() 創建配方步驟的規範，該步驟將返回文本變量的 lda 維度估計值。

用法

step_lda(
  recipe,
  ...,
  role = "predictor",
  trained = FALSE,
  columns = NULL,
  lda_models = NULL,
  num_topics = 10L,
  prefix = "lda",
  keep_original_cols = FALSE,
  skip = FALSE,
  id = rand_id("lda")
)

來源

https://arxiv.org/abs/1301.3781

參數

recipe: 一個recipe 對象。該步驟將添加到此配方的操作序列中。
...: 一個或多個選擇器函數用於選擇受該步驟影響的變量。有關更多詳細信息，請參閱recipes::selections()。
role: 對於此步驟創建的模型項，應為它們分配什麽分析角色？默認情況下，該函數假定由原始變量創建的新列將用作模型中的預測變量。
trained: 指示預處理數量是否已估計的邏輯。
columns: 將由 terms 參數(最終)填充的變量名稱字符串。在 recipes::prep.recipe() 訓練該步驟之前，這是 NULL 。
lda_models: text2vec 包中的 WarpLDA 模型對象。如果保留為 NULL(默認值)，它將根據訓練數據訓練其模型。查看示例，了解如何擬合 WarpLDA 模型。
num_topics: 整數所需的潛在主題數。
prefix: 生成的列名稱的前綴，默認為"lda"。
keep_original_cols: 將原始變量保留在輸出中的邏輯。默認為 FALSE 。
skip: 一個合乎邏輯的。當recipes::bake.recipe() 烘焙食譜時是否應該跳過此步驟？雖然所有操作都是在 recipes::prep.recipe() 運行時烘焙的，但某些操作可能無法對新數據進行(例如處理結果變量)。使用 skip = FALSE 時應小心。
id: 該步驟特有的字符串，用於標識它。

值

recipe 的更新版本，其中新步驟添加到現有步驟(如果有)的序列中。

整理

當您tidy()此步驟時，會出現一個包含列terms(選擇的選擇器或變量)和num_topics(主題數)的小標題。

箱重

底層操作不允許使用案例權重。

也可以看看

來自標記的數字變量的其他步驟：step_texthash()、step_tfidf()、step_tf()、step_word_embeddings()

例子

library(recipes)
library(modeldata)
data(tate_text)

tate_rec <- recipe(~., data = tate_text) %>%
  step_tokenize(medium) %>%
  step_lda(medium)

tate_obj <- tate_rec %>%
  prep()
#> 'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
#> Use 'as(., "CsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").

bake(tate_obj, new_data = NULL) %>%
  slice(1:2)
#> # A tibble: 2 × 14
#>      id artist          title  year lda_medium_1 lda_medium_2 lda_medium_3
#>   <dbl> <fct>           <fct> <dbl>        <dbl>        <dbl>        <dbl>
#> 1 21926 Absalon         Prop…  1990          0.7       0.0143       0.0143
#> 2 20472 Auerbach, Frank Mich…  1990          0         0            0     
#> # ℹ 7 more variables: lda_medium_4 <dbl>, lda_medium_5 <dbl>,
#> #   lda_medium_6 <dbl>, lda_medium_7 <dbl>, lda_medium_8 <dbl>,
#> #   lda_medium_9 <dbl>, lda_medium_10 <dbl>
tidy(tate_rec, number = 2)
#> # A tibble: 1 × 3
#>   terms  num_topics id       
#>   <chr>       <int> <chr>    
#> 1 medium         10 lda_UfL6S
tidy(tate_obj, number = 2)
#> # A tibble: 1 × 3
#>   terms  num_topics id       
#>   <chr>       <int> <chr>    
#> 1 medium         10 lda_UfL6S

# Changing the number of topics.
recipe(~., data = tate_text) %>%
  step_tokenize(medium, artist) %>%
  step_lda(medium, artist, num_topics = 20) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  slice(1:2)
#> # A tibble: 2 × 43
#>      id title     year lda_medium_1 lda_medium_2 lda_medium_3 lda_medium_4
#>   <dbl> <fct>    <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#> 1 21926 Proposa…  1990       0.0286       0.0286            0            0
#> 2 20472 Michael   1990       0            0                 0            0
#> # ℹ 36 more variables: lda_medium_5 <dbl>, lda_medium_6 <dbl>,
#> #   lda_medium_7 <dbl>, lda_medium_8 <dbl>, lda_medium_9 <dbl>,
#> #   lda_medium_10 <dbl>, lda_medium_11 <dbl>, lda_medium_12 <dbl>,
#> #   lda_medium_13 <dbl>, lda_medium_14 <dbl>, lda_medium_15 <dbl>,
#> #   lda_medium_16 <dbl>, lda_medium_17 <dbl>, lda_medium_18 <dbl>,
#> #   lda_medium_19 <dbl>, lda_medium_20 <dbl>, lda_artist_1 <dbl>,
#> #   lda_artist_2 <dbl>, lda_artist_3 <dbl>, lda_artist_4 <dbl>, …

# Supplying A pre-trained LDA model trained using text2vec
library(text2vec)
tokens <- word_tokenizer(tolower(tate_text$medium))
it <- itoken(tokens, ids = seq_along(tate_text$medium))
v <- create_vocabulary(it)
dtm <- create_dtm(it, vocab_vectorizer(v))
lda_model <- LDA$new(n_topics = 15)

recipe(~., data = tate_text) %>%
  step_tokenize(medium, artist) %>%
  step_lda(medium, artist, lda_models = lda_model) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  slice(1:2)
#> # A tibble: 2 × 33
#>      id title     year lda_medium_1 lda_medium_2 lda_medium_3 lda_medium_4
#>   <dbl> <fct>    <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
#> 1 21926 Proposa…  1990       0.0143        0.129            0       0.0143
#> 2 20472 Michael   1990       0             0                0       0     
#> # ℹ 26 more variables: lda_medium_5 <dbl>, lda_medium_6 <dbl>,
#> #   lda_medium_7 <dbl>, lda_medium_8 <dbl>, lda_medium_9 <dbl>,
#> #   lda_medium_10 <dbl>, lda_medium_11 <dbl>, lda_medium_12 <dbl>,
#> #   lda_medium_13 <dbl>, lda_medium_14 <dbl>, lda_medium_15 <dbl>,
#> #   lda_artist_1 <dbl>, lda_artist_2 <dbl>, lda_artist_3 <dbl>,
#> #   lda_artist_4 <dbl>, lda_artist_5 <dbl>, lda_artist_6 <dbl>,
#> #   lda_artist_7 <dbl>, lda_artist_8 <dbl>, lda_artist_9 <dbl>, …

源代碼：R/lda.R

相關用法

注：本文由純淨天空篩選整理自等大神的英文原創作品 Calculate LDA Dimension Estimates of Tokens。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。