本文簡要介紹
pyspark.ml.clustering.LDA
的用法。用法:
class pyspark.ml.clustering.LDA(*, featuresCol='features', maxIter=20, seed=None, checkpointInterval=10, k=10, optimizer='online', learningOffset=1024.0, learningDecay=0.51, subsamplingRate=0.05, optimizeDocConcentration=True, docConcentration=None, topicConcentration=None, topicDistributionCol='topicDistribution', keepLastCheckpoint=True)
潛在狄利克雷分配 (LDA),一種專為文本文檔設計的主題模型。
術語:
“term” = “word”:詞匯表的一個元素
“token”:文檔中出現的術語實例
“topic”:代表某些概念的術語的多項分布
“document”:一段文字,對應輸入數據中的一行
布萊、吳和喬丹。 “潛在狄利克雷分配。” JMLR,2003 年。
原始 LDA 論文(期刊版)::
輸入數據(featuresCol):通過 featuresCol 參數給 LDA 一個文檔集合作為輸入數據。每個文檔都指定為長度為 vocabSize 的
Vector
,其中每個條目是文檔中相應術語(單詞)的計數。pyspark.ml.feature.Tokenizer
和pyspark.ml.feature.CountVectorizer
等特征轉換器可用於將文本轉換為字數向量。2.0.0 版中的新函數。
例子:
>>> from pyspark.ml.linalg import Vectors, SparseVector >>> from pyspark.ml.clustering import LDA >>> df = spark.createDataFrame([[1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})],], ["id", "features"]) >>> lda = LDA(k=2, seed=1, optimizer="em") >>> lda.setMaxIter(10) LDA... >>> lda.getMaxIter() 10 >>> lda.clear(lda.maxIter) >>> model = lda.fit(df) >>> model.setSeed(1) DistributedLDAModel... >>> model.getTopicDistributionCol() 'topicDistribution' >>> model.isDistributed() True >>> localModel = model.toLocal() >>> localModel.isDistributed() False >>> model.vocabSize() 2 >>> model.describeTopics().show() +-----+-----------+--------------------+ |topic|termIndices| termWeights| +-----+-----------+--------------------+ | 0| [1, 0]|[0.50401530077160...| | 1| [0, 1]|[0.50401530077160...| +-----+-----------+--------------------+ ... >>> model.topicsMatrix() DenseMatrix(2, 2, [0.496, 0.504, 0.504, 0.496], 0) >>> lda_path = temp_path + "/lda" >>> lda.save(lda_path) >>> sameLDA = LDA.load(lda_path) >>> distributed_model_path = temp_path + "/lda_distributed_model" >>> model.save(distributed_model_path) >>> sameModel = DistributedLDAModel.load(distributed_model_path) >>> local_model_path = temp_path + "/lda_local_model" >>> localModel.save(local_model_path) >>> sameLocalModel = LocalLDAModel.load(local_model_path) >>> model.transform(df).take(1) == sameLocalModel.transform(df).take(1) True
相關用法
- Python pyspark LDA.setLearningDecay用法及代碼示例
- Python pyspark LDA.setDocConcentration用法及代碼示例
- Python pyspark LDAModel用法及代碼示例
- Python pyspark LDA.setOptimizer用法及代碼示例
- Python pyspark LDA.setK用法及代碼示例
- Python pyspark LDA.setLearningOffset用法及代碼示例
- Python pyspark LDA.setTopicDistributionCol用法及代碼示例
- Python pyspark LDA.setKeepLastCheckpoint用法及代碼示例
- Python pyspark LDA.setSubsamplingRate用法及代碼示例
- Python pyspark LDA.setTopicConcentration用法及代碼示例
- Python pyspark LDA.setOptimizeDocConcentration用法及代碼示例
- Python pyspark LogisticRegressionWithLBFGS.train用法及代碼示例
- Python pyspark LinearRegressionModel用法及代碼示例
- Python pyspark LinearSVC用法及代碼示例
- Python pyspark LinearRegression用法及代碼示例
- Python pyspark LassoModel用法及代碼示例
- Python pyspark LogisticRegressionModel用法及代碼示例
- Python pyspark LogisticRegression用法及代碼示例
- Python pyspark create_map用法及代碼示例
- Python pyspark date_add用法及代碼示例
- Python pyspark DataFrame.to_latex用法及代碼示例
- Python pyspark DataStreamReader.schema用法及代碼示例
- Python pyspark MultiIndex.size用法及代碼示例
- Python pyspark arrays_overlap用法及代碼示例
- Python pyspark Series.asof用法及代碼示例
注:本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.clustering.LDA。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。