Python pyspark LDA用法及代码示例

本文简要介绍 pyspark.ml.clustering.LDA 的用法。

用法: class pyspark.ml.clustering.LDA(*, featuresCol='features', maxIter=20, seed=None, checkpointInterval=10, k=10, optimizer='online', learningOffset=1024.0, learningDecay=0.51, subsamplingRate=0.05, optimizeDocConcentration=True, docConcentration=None, topicConcentration=None, topicDistributionCol='topicDistribution', keepLastCheckpoint=True)

潜在狄利克雷分配 (LDA)，一种专为文本文档设计的主题模型。

术语：

“term” = “word”：词汇表的一个元素
“token”：文档中出现的术语实例
“topic”：代表某些概念的术语的多项分布
“document”：一段文字，对应输入数据中的一行

原始 LDA 论文(期刊版)：：

布莱、吴和乔丹。 “潜在狄利克雷分配。” JMLR，2003 年。

输入数据(featuresCol)：通过 featuresCol 参数给 LDA 一个文档集合作为输入数据。每个文档都指定为长度为 vocabSize 的Vector，其中每个条目是文档中相应术语(单词)的计数。 pyspark.ml.feature.Tokenizer 和 pyspark.ml.feature.CountVectorizer 等特征转换器可用于将文本转换为字数向量。

2.0.0 版中的新函数。

例子：

>>> from pyspark.ml.linalg import Vectors, SparseVector
>>> from pyspark.ml.clustering import LDA
>>> df = spark.createDataFrame([[1, Vectors.dense([0.0, 1.0])],
...      [2, SparseVector(2, {0: 1.0})],], ["id", "features"])
>>> lda = LDA(k=2, seed=1, optimizer="em")
>>> lda.setMaxIter(10)
LDA...
>>> lda.getMaxIter()
10
>>> lda.clear(lda.maxIter)
>>> model = lda.fit(df)
>>> model.setSeed(1)
DistributedLDAModel...
>>> model.getTopicDistributionCol()
'topicDistribution'
>>> model.isDistributed()
True
>>> localModel = model.toLocal()
>>> localModel.isDistributed()
False
>>> model.vocabSize()
2
>>> model.describeTopics().show()
+-----+-----------+--------------------+
|topic|termIndices|         termWeights|
+-----+-----------+--------------------+
|    0|     [1, 0]|[0.50401530077160...|
|    1|     [0, 1]|[0.50401530077160...|
+-----+-----------+--------------------+
...
>>> model.topicsMatrix()
DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0)
>>> lda_path = temp_path + "/lda"
>>> lda.save(lda_path)
>>> sameLDA = LDA.load(lda_path)
>>> distributed_model_path = temp_path + "/lda_distributed_model"
>>> model.save(distributed_model_path)
>>> sameModel = DistributedLDAModel.load(distributed_model_path)
>>> local_model_path = temp_path + "/lda_local_model"
>>> localModel.save(local_model_path)
>>> sameLocalModel = LocalLDAModel.load(local_model_path)
>>> model.transform(df).take(1) == sameLocalModel.transform(df).take(1)
True

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.clustering.LDA。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。