Python sklearn LatentDirichletAllocation用法及代码示例

本文简要介绍python语言中 sklearn.decomposition.LatentDirichletAllocation 的用法。

用法: class sklearn.decomposition.LatentDirichletAllocation(n_components=10, *, doc_topic_prior=None, topic_word_prior=None, learning_method='batch', learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=- 1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=None, verbose=0, random_state=None)

具有在线变分贝叶斯算法的潜在狄利克雷分配。

实现基于[1]和[2]。

在用户指南中阅读更多信息。

参数：

n_components：整数，默认=10

主题数。

doc_topic_prior：浮点数，默认=无

文档主题分布先验theta 。如果值为 None，则默认为 1 / n_components 。在[1]中，这称为alpha。

topic_word_prior：浮点数，默认=无

主题词分布先验beta 。如果值为 None，则默认为 1 / n_components 。在[1]中，这称为eta。

learning_method：{‘batch’, ‘online’}，默认='批次'

用于更新 _component 的方法。仅在fit方法中使用。一般来说，如果数据量很大，在线更新会比批量更新快很多。

有效选项：

'batch': Batch variational Bayes method. Use all training data in
    each EM update.
    Old `components_` will be overwritten in each iteration.
'online': Online variational Bayes method. In each EM update, use
    mini-batch of training data to update the ``components_``
    variable incrementally. The learning rate is controlled by the
    ``learning_decay`` and the ``learning_offset`` parameters.

learning_decay：浮点数，默认=0.7

是在线学习方法中控制学习率的参数。该值应设置在 (0.5, 1.0] 之间以保证渐近收敛。当值为 0.0 且 batch_size 为 n_samples 时，更新方法与批学习相同。在文献中，这称为 kappa。

learning_offset：浮点数，默认=10.0

一个(正)参数，用于降低在线学习中早期迭代的权重。它应该大于 1.0。在文献中，这称为 tau_0。

max_iter：整数，默认=10

训练数据的最大传递次数(又名 epochs)。它只影响fit 方法中的行为，而不影响partial_fit 方法中的行为。

batch_size：整数，默认=128

每次 EM 迭代中使用的文档数。仅用于在线学习。

evaluate_every：整数，默认=-1

多久评估一次困惑。仅用于fit 方法。将其设置为 0 或负数以根本不评估训练中的困惑度。评估困惑度可以帮助您检查训练过程中的收敛性，但它也会增加总训练时间。在每次迭代中评估困惑度可能会将训练时间增加到 two-fold。

total_samples：整数，默认=1e6

文件总数。仅在partial_fit方法中使用。

perp_tol：浮点数，默认=1e-1

批量学习中的困惑容忍度。仅在evaluate_every 大于 0 时使用。

mean_change_tol：浮点数，默认=1e-3

E-step 中更新文档主题分布的停止容差。

max_doc_update_iter：整数，默认=100

E-step 中更新文档主题分布的最大迭代次数。

n_jobs：整数，默认=无

E-step 中要使用的作业数。 None 表示 1，除非在 joblib.parallel_backend 上下文中。 -1 表示使用所有处理器。有关详细信息，请参阅词汇表。

verbose：整数，默认=0

详细程度。

random_state：int、RandomState 实例或无，默认=无

传递 int 以获得跨多个函数调用的可重现结果。请参阅词汇表。

属性：

components_：ndarray 形状(n_components，n_features): 主题词分布的变分参数。由于主题词分布的完整条件是狄利克雷，components_[i, j] 可以被视为表示词 j 被分配给主题 i 的次数的伪计数。它也可以看作是标准化后每个主题的单词分布：model.components_ / model.components_.sum(axis=1)[:, np.newaxis]。
exp_dirichlet_component_：ndarray 形状(n_components，n_features): 日志主题词分布的期望 index 值。在文献中，这是 exp(E[log(beta)]) 。
n_batch_iter_：int: EM 步骤的迭代次数。
n_features_in_：int: 拟合期间看到的特征数。
feature_names_in_：ndarray 形状(n_features_in_，): 拟合期间看到的特征名称。仅当 X 具有全为字符串的函数名称时才定义。
n_iter_：int: 通过数据集的次数。
bound_：浮点数: 训练集的最终困惑度得分。
doc_topic_prior_：浮点数: 文档主题分发之前的 theta 。如果值为 None，则为 1 / n_components 。
random_state_：RandomState 实例: RandomState 实例由种子、随机数生成器或由 np.random 生成。
topic_word_prior_：浮点数: 主题词分布的先验 beta 。如果值为 None，则为 1 / n_components 。

参考：

1(1,2,3): “潜在狄利克雷分配的在线学习”，Matthew D. Hoffman、David M. Blei、Francis Bach，2010 https://github.com/blei-lab/onlineldavb
2: “Stochastic Variational Inference”，Matthew D. Hoffman、David M. Blei、Chong Wang、John Paisley，2013

例子：

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> from sklearn.datasets import make_multilabel_classification
>>> # This produces a feature matrix of token counts, similar to what
>>> # CountVectorizer would produce on text.
>>> X, _ = make_multilabel_classification(random_state=0)
>>> lda = LatentDirichletAllocation(n_components=5,
...     random_state=0)
>>> lda.fit(X)
LatentDirichletAllocation(...)
>>> # get topics for some given samples:
>>> lda.transform(X[-2:])
array([[0.00360392, 0.25499205, 0.0036211 , 0.64236448, 0.09541846],
       [0.15297572, 0.00362644, 0.44412786, 0.39568399, 0.003586  ]])

相关用法

注：本文由纯净天空筛选整理自scikit-learn.org大神的英文原创作品 sklearn.decomposition.LatentDirichletAllocation。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。