本文簡要介紹
pyspark.mllib.clustering.LDAModel
的用法。用法:
class pyspark.mllib.clustering.LDAModel(java_model)
從 LDA 方法派生的聚類模型。
潛在狄利克雷分配 (LDA),一種專為文本文檔設計的主題模型。術語
“word” = “term”:詞匯表的一個元素
“token”:文檔中出現的術語實例
“topic”:代表某個概念的詞的多項分布
1.5.0 版中的新函數。
注意:
參見原始 LDA 論文(期刊版)[1]
Blei, D. 等人。 “潛在狄利克雷分配。” J.馬赫。學習。水庫。 3 (2003): 993-1022。 https://www.jmlr.org/papers/v3/blei03a
1:
例子:
>>> from pyspark.mllib.linalg import Vectors >>> from numpy.testing import assert_almost_equal, assert_equal >>> data = [ ... [1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})], ... ] >>> rdd = sc.parallelize(data) >>> model = LDA.train(rdd, k=2, seed=1) >>> model.vocabSize() 2 >>> model.describeTopics() [([1, 0], [0.5..., 0.49...]), ([0, 1], [0.5..., 0.49...])] >>> model.describeTopics(1) [([1], [0.5...]), ([0], [0.5...])]
>>> topics = model.topicsMatrix() >>> topics_expect = array([[0.5, 0.5], [0.5, 0.5]]) >>> assert_almost_equal(topics, topics_expect, 1)
>>> import os, tempfile >>> from shutil import rmtree >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = LDAModel.load(sc, path) >>> assert_equal(sameModel.topicsMatrix(), model.topicsMatrix()) >>> sameModel.vocabSize() == model.vocabSize() True >>> try: ... rmtree(path) ... except OSError: ... pass
相關用法
- Python pyspark LDA.setLearningDecay用法及代碼示例
- Python pyspark LDA.setDocConcentration用法及代碼示例
- Python pyspark LDA用法及代碼示例
- Python pyspark LDA.setOptimizer用法及代碼示例
- Python pyspark LDA.setK用法及代碼示例
- Python pyspark LDA.setLearningOffset用法及代碼示例
- Python pyspark LDA.setTopicDistributionCol用法及代碼示例
- Python pyspark LDA.setKeepLastCheckpoint用法及代碼示例
- Python pyspark LDA.setSubsamplingRate用法及代碼示例
- Python pyspark LDA.setTopicConcentration用法及代碼示例
- Python pyspark LDA.setOptimizeDocConcentration用法及代碼示例
- Python pyspark LogisticRegressionWithLBFGS.train用法及代碼示例
- Python pyspark LinearRegressionModel用法及代碼示例
- Python pyspark LinearSVC用法及代碼示例
- Python pyspark LinearRegression用法及代碼示例
- Python pyspark LassoModel用法及代碼示例
- Python pyspark LogisticRegressionModel用法及代碼示例
- Python pyspark LogisticRegression用法及代碼示例
- Python pyspark create_map用法及代碼示例
- Python pyspark date_add用法及代碼示例
- Python pyspark DataFrame.to_latex用法及代碼示例
- Python pyspark DataStreamReader.schema用法及代碼示例
- Python pyspark MultiIndex.size用法及代碼示例
- Python pyspark arrays_overlap用法及代碼示例
- Python pyspark Series.asof用法及代碼示例
注:本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.mllib.clustering.LDAModel。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。