本文简要介绍
pyspark.ml.clustering.LDA
的用法。用法:
class pyspark.ml.clustering.LDA(*, featuresCol='features', maxIter=20, seed=None, checkpointInterval=10, k=10, optimizer='online', learningOffset=1024.0, learningDecay=0.51, subsamplingRate=0.05, optimizeDocConcentration=True, docConcentration=None, topicConcentration=None, topicDistributionCol='topicDistribution', keepLastCheckpoint=True)
潜在狄利克雷分配 (LDA),一种专为文本文档设计的主题模型。
术语:
“term” = “word”:词汇表的一个元素
“token”:文档中出现的术语实例
“topic”:代表某些概念的术语的多项分布
“document”:一段文字,对应输入数据中的一行
布莱、吴和乔丹。 “潜在狄利克雷分配。” JMLR,2003 年。
原始 LDA 论文(期刊版)::
输入数据(featuresCol):通过 featuresCol 参数给 LDA 一个文档集合作为输入数据。每个文档都指定为长度为 vocabSize 的
Vector
,其中每个条目是文档中相应术语(单词)的计数。pyspark.ml.feature.Tokenizer
和pyspark.ml.feature.CountVectorizer
等特征转换器可用于将文本转换为字数向量。2.0.0 版中的新函数。
例子:
>>> from pyspark.ml.linalg import Vectors, SparseVector >>> from pyspark.ml.clustering import LDA >>> df = spark.createDataFrame([[1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})],], ["id", "features"]) >>> lda = LDA(k=2, seed=1, optimizer="em") >>> lda.setMaxIter(10) LDA... >>> lda.getMaxIter() 10 >>> lda.clear(lda.maxIter) >>> model = lda.fit(df) >>> model.setSeed(1) DistributedLDAModel... >>> model.getTopicDistributionCol() 'topicDistribution' >>> model.isDistributed() True >>> localModel = model.toLocal() >>> localModel.isDistributed() False >>> model.vocabSize() 2 >>> model.describeTopics().show() +-----+-----------+--------------------+ |topic|termIndices| termWeights| +-----+-----------+--------------------+ | 0| [1, 0]|[0.50401530077160...| | 1| [0, 1]|[0.50401530077160...| +-----+-----------+--------------------+ ... >>> model.topicsMatrix() DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0) >>> lda_path = temp_path + "/lda" >>> lda.save(lda_path) >>> sameLDA = LDA.load(lda_path) >>> distributed_model_path = temp_path + "/lda_distributed_model" >>> model.save(distributed_model_path) >>> sameModel = DistributedLDAModel.load(distributed_model_path) >>> local_model_path = temp_path + "/lda_local_model" >>> localModel.save(local_model_path) >>> sameLocalModel = LocalLDAModel.load(local_model_path) >>> model.transform(df).take(1) == sameLocalModel.transform(df).take(1) True
相关用法
- Python pyspark LDA.setLearningDecay用法及代码示例
- Python pyspark LDA.setDocConcentration用法及代码示例
- Python pyspark LDAModel用法及代码示例
- Python pyspark LDA.setOptimizer用法及代码示例
- Python pyspark LDA.setK用法及代码示例
- Python pyspark LDA.setLearningOffset用法及代码示例
- Python pyspark LDA.setTopicDistributionCol用法及代码示例
- Python pyspark LDA.setKeepLastCheckpoint用法及代码示例
- Python pyspark LDA.setSubsamplingRate用法及代码示例
- Python pyspark LDA.setTopicConcentration用法及代码示例
- Python pyspark LDA.setOptimizeDocConcentration用法及代码示例
- Python pyspark LogisticRegressionWithLBFGS.train用法及代码示例
- Python pyspark LinearRegressionModel用法及代码示例
- Python pyspark LinearSVC用法及代码示例
- Python pyspark LinearRegression用法及代码示例
- Python pyspark LassoModel用法及代码示例
- Python pyspark LogisticRegressionModel用法及代码示例
- Python pyspark LogisticRegression用法及代码示例
- Python pyspark create_map用法及代码示例
- Python pyspark date_add用法及代码示例
- Python pyspark DataFrame.to_latex用法及代码示例
- Python pyspark DataStreamReader.schema用法及代码示例
- Python pyspark MultiIndex.size用法及代码示例
- Python pyspark arrays_overlap用法及代码示例
- Python pyspark Series.asof用法及代码示例
注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.clustering.LDA。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。