Python dask_ml.feature_extraction.text.CountVectorizer用法及代码示例

用法: class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)

将文本文档集合转换为令牌计数矩阵

注意：

如果未提供词汇表，fit_transform 需要两次遍历数据集：一次用于学习词汇表，第二次用于转换数据。考虑在不提供 vocabulary 时调用 fit 或 transform 之前在(分布式)内存中保存数据。

此外，即使在单台机器上，此实现也受益于具有活动的 dask.distributed.Client 。当客户端存在时，学习到的vocabulary 会持久保存在分布式内存中，这样可以节省一些重新计算和冗余通信。

例子：

Dask-ML 实现当前要求raw_documents 是文档(字符串列表)的dask.bag.Bag。

>>> from dask_ml.feature_extraction.text import CountVectorizer
>>> import dask.bag as db
>>> from distributed import Client
>>> client = Client()
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> corpus = db.from_sequence(corpus, npartitions=2)
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ...
           chunktype=scipy.csr_matrix>
>>> X.compute().toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])
>>> vectorizer.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

相关用法

注：本文由纯净天空筛选整理自dask.org大神的英文原创作品 dask_ml.feature_extraction.text.CountVectorizer。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。