Python sklearn LatentDirichletAllocation用法及代碼示例

本文簡要介紹python語言中 sklearn.decomposition.LatentDirichletAllocation 的用法。

用法: class sklearn.decomposition.LatentDirichletAllocation(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=- 1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

具有在線變分貝葉斯算法的潛在狄利克雷分配。

實現基於[1]和[2]。

在用戶指南中閱讀更多信息。

參數：

n_components：整數，默認=10

主題數。

doc_topic_prior：浮點數，默認=無

文檔主題分布先驗theta 。如果值為 None，則默認為 1 / n_components 。在[1]中，這稱為alpha。

topic_word_prior：浮點數，默認=無

主題詞分布先驗beta 。如果值為 None，則默認為 1 / n_components 。在[1]中，這稱為eta。

learning_method：{‘batch’, ‘online’}，默認='批次'

用於更新 _component 的方法。僅在fit方法中使用。一般來說，如果數據量很大，在線更新會比批量更新快很多。

有效選項：

'batch': Batch variational Bayes method. Use all training data in
    each EM update.
    Old `components_` will be overwritten in each iteration.
'online': Online variational Bayes method. In each EM update, use
    mini-batch of training data to update the ``components_``
    variable incrementally. The learning rate is controlled by the
    ``learning_decay`` and the ``learning_offset`` parameters.

learning_decay：浮點數，默認=0.7

是在線學習方法中控製學習率的參數。該值應設置在 (0.5, 1.0] 之間以保證漸近收斂。當值為 0.0 且 batch_size 為 n_samples 時，更新方法與批學習相同。在文獻中，這稱為 kappa。

learning_offset：浮點數，默認=10.0

一個(正)參數，用於降低在線學習中早期迭代的權重。它應該大於 1.0。在文獻中，這稱為 tau_0。

max_iter：整數，默認=10

訓練數據的最大傳遞次數(又名 epochs)。它隻影響fit 方法中的行為，而不影響partial_fit 方法中的行為。

batch_size：整數，默認=128

每次 EM 迭代中使用的文檔數。僅用於在線學習。

evaluate_every：整數，默認=-1

多久評估一次困惑。僅用於fit 方法。將其設置為 0 或負數以根本不評估訓練中的困惑度。評估困惑度可以幫助您檢查訓練過程中的收斂性，但它也會增加總訓練時間。在每次迭代中評估困惑度可能會將訓練時間增加到 two-fold。

total_samples：整數，默認=1e6

文件總數。僅在partial_fit方法中使用。

perp_tol：浮點數，默認=1e-1

批量學習中的困惑容忍度。僅在evaluate_every 大於 0 時使用。

mean_change_tol：浮點數，默認=1e-3

E-step 中更新文檔主題分布的停止容差。

max_doc_update_iter：整數，默認=100

E-step 中更新文檔主題分布的最大迭代次數。

n_jobs：整數，默認=無

E-step 中要使用的作業數。 None 表示 1，除非在 joblib.parallel_backend 上下文中。 -1 表示使用所有處理器。有關詳細信息，請參閱詞匯表。

verbose：整數，默認=0

詳細程度。

random_state：int、RandomState 實例或無，默認=無

傳遞 int 以獲得跨多個函數調用的可重現結果。請參閱詞匯表。

屬性：

components_：ndarray 形狀(n_components，n_features): 主題詞分布的變分參數。由於主題詞分布的完整條件是狄利克雷，components_[i, j] 可以被視為表示詞 j 被分配給主題 i 的次數的偽計數。它也可以看作是標準化後每個主題的單詞分布：model.components_ / model.components_.sum(axis=1)[:, np.newaxis]。
exp_dirichlet_component_：ndarray 形狀(n_components，n_features): 日誌主題詞分布的期望 index 值。在文獻中，這是 exp(E[log(beta)]) 。
n_batch_iter_：int: EM 步驟的迭代次數。
n_features_in_：int: 擬合期間看到的特征數。
feature_names_in_：ndarray 形狀(n_features_in_，): 擬合期間看到的特征名稱。僅當 X 具有全為字符串的函數名稱時才定義。
n_iter_：int: 通過數據集的次數。
bound_：浮點數: 訓練集的最終困惑度得分。
doc_topic_prior_：浮點數: 文檔主題分發之前的 theta 。如果值為 None，則為 1 / n_components 。
random_state_：RandomState 實例: RandomState 實例由種子、隨機數生成器或由 np.random 生成。
topic_word_prior_：浮點數: 主題詞分布的先驗 beta 。如果值為 None，則為 1 / n_components 。

參考：

1(1,2,3): “潛在狄利克雷分配的在線學習”，Matthew D. Hoffman、David M. Blei、Francis Bach，2010 https://github.com/blei-lab/onlineldavb
2: “Stochastic Variational Inference”，Matthew D. Hoffman、David M. Blei、Chong Wang、John Paisley，2013

例子：

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> from sklearn.datasets import make_multilabel_classification
>>> # This produces a feature matrix of token counts, similar to what
>>> # CountVectorizer would produce on text.
>>> X, _ = make_multilabel_classification(random_state=0)
>>> lda = LatentDirichletAllocation(n_components=5,
...     random_state=0)
>>> lda.fit(X)
LatentDirichletAllocation(...)
>>> # get topics for some given samples:
>>> lda.transform(X[-2:])
array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846],
       [0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586  ]])

相關用法

注：本文由純淨天空篩選整理自scikit-learn.org大神的英文原創作品 sklearn.decomposition.LatentDirichletAllocation。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。