Python dask_ml.feature_extraction.text.CountVectorizer用法及代碼示例

用法: class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

將文本文檔集合轉換為令牌計數矩陣

注意：

如果未提供詞匯表，fit_transform 需要兩次遍曆數據集：一次用於學習詞匯表，第二次用於轉換數據。考慮在不提供 vocabulary 時調用 fit 或 transform 之前在(分布式)內存中保存數據。

此外，即使在單台機器上，此實現也受益於具有活動的 dask.distributed.Client 。當客戶端存在時，學習到的vocabulary 會持久保存在分布式內存中，這樣可以節省一些重新計算和冗餘通信。

例子：

Dask-ML 實現當前要求raw_documents 是文檔(字符串列表)的dask.bag.Bag。

>>> from dask_ml.feature_extraction.text import CountVectorizer
>>> import dask.bag as db
>>> from distributed import Client
>>> client = Client()
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> corpus = db.from_sequence(corpus, npartitions=2)
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ...
           chunktype=scipy.csr_matrix>
>>> X.compute().toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

相關用法

注：本文由純淨天空篩選整理自dask.org大神的英文原創作品 dask_ml.feature_extraction.text.CountVectorizer。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。