Python pyspark LDA用法及代碼示例

本文簡要介紹 pyspark.ml.clustering.LDA 的用法。

用法: class pyspark.ml.clustering.LDA(*, featuresCol='features', maxIter=20, seed=None, checkpointInterval=10, k=10, optimizer='online', learningOffset=1024.0, learningDecay=0.51, subsamplingRate=0.05, optimizeDocConcentration=True, docConcentration=None, topicConcentration=None, topicDistributionCol='topicDistribution', keepLastCheckpoint=True)

潛在狄利克雷分配 (LDA)，一種專為文本文檔設計的主題模型。

術語：

“term” = “word”：詞匯表的一個元素
“token”：文檔中出現的術語實例
“topic”：代表某些概念的術語的多項分布
“document”：一段文字，對應輸入數據中的一行

原始 LDA 論文(期刊版)：：

布萊、吳和喬丹。 “潛在狄利克雷分配。” JMLR，2003 年。

輸入數據(featuresCol)：通過 featuresCol 參數給 LDA 一個文檔集合作為輸入數據。每個文檔都指定為長度為 vocabSize 的Vector，其中每個條目是文檔中相應術語(單詞)的計數。 pyspark.ml.feature.Tokenizer 和 pyspark.ml.feature.CountVectorizer 等特征轉換器可用於將文本轉換為字數向量。

2.0.0 版中的新函數。

例子：

>>> from pyspark.ml.linalg import Vectors, SparseVector
>>> from pyspark.ml.clustering import LDA
>>> df = spark.createDataFrame([[1, Vectors.dense([0.0, 1.0])],
...      [2, SparseVector(2, {0: 1.0})],], ["id", "features"])
>>> lda = LDA(k=2, seed=1, optimizer="em")
>>> lda.setMaxIter(10)
LDA...
>>> lda.getMaxIter()
10
>>> lda.clear(lda.maxIter)
>>> model = lda.fit(df)
>>> model.setSeed(1)
DistributedLDAModel...
>>> model.getTopicDistributionCol()
'topicDistribution'
>>> model.isDistributed()
True
>>> localModel = model.toLocal()
>>> localModel.isDistributed()
False
>>> model.vocabSize()
2
>>> model.describeTopics().show()
+-----+-----------+--------------------+
|topic|termIndices|         termWeights|
+-----+-----------+--------------------+
|    0|     [1, 0]|[0.50401530077160...|
|    1|     [0, 1]|[0.50401530077160...|
+-----+-----------+--------------------+
...
>>> model.topicsMatrix()
DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0)
>>> lda_path = temp_path + "/lda"
>>> lda.save(lda_path)
>>> sameLDA = LDA.load(lda_path)
>>> distributed_model_path = temp_path + "/lda_distributed_model"
>>> model.save(distributed_model_path)
>>> sameModel = DistributedLDAModel.load(distributed_model_path)
>>> local_model_path = temp_path + "/lda_local_model"
>>> localModel.save(local_model_path)
>>> sameLocalModel = LocalLDAModel.load(local_model_path)
>>> model.transform(df).take(1) == sameLocalModel.transform(df).take(1)
True

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.clustering.LDA。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。