sklearn例程:NMF和LDA主題提取

NMF和LDA主題提取簡介

非負矩陣分解，即Non-negative Matrix Factorization，簡寫為NMF。

潛在狄利克雷分布，即Latent Dirichlet Allocation，簡寫為LDA。

本文是應用sklearn.decomposition.NMF和sklearn.decomposition.LatentDirichletAllocation的一個例子，在示例中對一個文檔語料庫進行分析，並提取該語料庫主題結構的附加模型。輸出是主題列表，每個主題都表示為單詞列表。

非負矩陣分解應用於兩個不同的目標函數：Frobenius範數和廣義Kullback-Leibler散度。後者等效於概率潛在語義索引（PLSA）。

使用默認參數(n_samples /n_features /n_components)的情況下，示例在幾十秒內即可運行完成。您可以嘗試增加維度，但是要注意時間複雜度以免運行過長時間：NMF的時間複雜度是多項式的；而LDA時間複雜度與(n_samples * iterations，樣本數乘以迭代次數)成正比。

代碼實現[Python]


# -*- coding: utf-8 -*- 

# Author: Olivier Grisel 
#         Lars Buitinck
#         Chyi-Kwei Yau 
# License: BSD 3 clause

from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20


def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()


# 加載20個新聞組數據集並將其向量化。我們使用一些啟發式方法盡早過濾掉無用的項：刪掉帖子中的標題、頁腳和引用的回複、以及常見的英語單詞，
# 刪除了隻出現在一個文檔的單詞以及出現於至少95％的文檔的單詞。

print("Loading dataset...")
t0 = time()
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:n_samples]
print("done in %0.3fs." % (time() - t0))

# 在NMF中使用tf-idf特征。
print("Extracting tf-idf features for NMF...")
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))

# 在LDA中使用tf（raw term count）特征Use。
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
print()

# 擬合NMF模型
print("Fitting the NMF model (Frobenius norm) with tf-idf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (Frobenius norm):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

# 擬合NMF模型
print("Fitting the NMF model (generalized Kullback-Leibler divergence) with "
      "tf-idf features, n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
t0 = time()
nmf = NMF(n_components=n_components, random_state=1,
          beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1,
          l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in NMF model (generalized Kullback-Leibler divergence):")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

print("Fitting LDA models with tf features, "
      "n_samples=%d and n_features=%d..."
      % (n_samples, n_features))
lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
t0 = time()
lda.fit(tf)
print("done in %0.3fs." % (time() - t0))

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)

代碼執行

代碼運行時間大約:0分13.781秒。
運行代碼輸出的文本內容如下。注意：輸出中顯示了每個主題對應的單詞，但是略去了每個單詞對應的權重。

Loading dataset...
done in 7.911s.
Extracting tf-idf features for NMF...
done in 0.268s.
Extracting tf features for LDA...
done in 0.254s.

Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...
done in 0.406s.

Topics in NMF model (Frobenius norm):
Topic #0: just people don think like know time good make way really say right ve want did ll new use years
Topic #1: windows use dos using window program os drivers application help software pc running ms screen files version card code work
Topic #2: god jesus bible faith christian christ christians does heaven sin believe lord life church mary atheism belief human love religion
Topic #3: thanks know does mail advance hi info interested email anybody looking card help like appreciated information send list video need
Topic #4: car cars tires miles 00 new engine insurance price condition oil power speed good 000 brake year models used bought
Topic #5: edu soon com send university internet mit ftp mail cc pub article information hope program mac email home contact blood
Topic #6: file problem files format win sound ftp pub read save site help image available create copy running memory self version
Topic #7: game team games year win play season players nhl runs goal hockey toronto division flyers player defense leafs bad teams
Topic #8: drive drives hard disk floppy software card mac computer power scsi controller apple mb 00 pc rom sale problem internal
Topic #9: key chip clipper keys encryption government public use secure enforcement phone nsa communications law encrypted security clinton used legal standard

Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000...
done in 1.769s.

Topics in NMF model (generalized Kullback-Leibler divergence):
Topic #0: just people don like did know make really right think say things time look way didn ve course probably good
Topic #1: help thanks windows know hi need using does looking anybody appreciated card mail software use info email ftp available pc
Topic #2: does god believe know mean true christians read point jesus christian church come people fact says religion say agree bible
Topic #3: know thanks mail interested like new just bike email edu advance want contact really list heard com post hear information
Topic #4: 10 new 30 12 20 50 11 sale 16 15 time 14 old power ago good 100 great offer cost
Topic #5: number 1993 data subject government new numbers provide information space following com research include large note group major time talk
Topic #6: edu problem file com remember try soon article mike files code program sun free send think cases manager little called
Topic #7: game year team games world fact second case won said win division play best clearly claim allow example used doesn
Topic #8: think don drive hard need bit mac make sure read apple going comes disk computer case pretty drives software ve
Topic #9: good just use like doesn got way don ll going does chip better doing bad key want sure bit car

Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
done in 3.167s.

Topics in LDA model:
Topic #0: edu com mail send graphics ftp pub available contact university list faq ca information cs 1993 program sun uk mit
Topic #1: don like just know think ve way use right good going make sure ll point got need really time doesn
Topic #2: christian think atheism faith pittsburgh new bible radio games alt lot just religion like book read play time subject believe
Topic #3: drive disk windows thanks use card drives hard version pc software file using scsi help does new dos controller 16
Topic #4: hiv health aids disease april medical care research 1993 light information study national service test led 10 page new drug
Topic #5: god people does just good don jesus say israel way life know true fact time law want believe make think
Topic #6: 55 10 11 18 15 team game 19 period play 23 12 13 flyers 20 25 22 17 24 16
Topic #7: car year just cars new engine like bike good oil insurance better tires 000 thing speed model brake driving performance
Topic #8: people said did just didn know time like went think children came come don took years say dead told started
Topic #9: key space law government public use encryption earth section security moon probe enforcement keys states lunar military crime surface technology

源碼下載

Python版源碼文件: plot_topics_extraction_with_nmf_lda.py
Jupyter Notebook版源碼文件: plot_topics_extraction_with_nmf_lda.ipynb

參考資料

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation