用法:
class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
将文本文档集合转换为令牌计数矩阵
注意:
如果未提供词汇表,
fit_transform
需要两次遍历数据集:一次用于学习词汇表,第二次用于转换数据。考虑在不提供vocabulary
时调用fit
或transform
之前在(分布式)内存中保存数据。此外,即使在单台机器上,此实现也受益于具有活动的
dask.distributed.Client
。当客户端存在时,学习到的vocabulary
会持久保存在分布式内存中,这样可以节省一些重新计算和冗余通信。例子:
Dask-ML 实现当前要求
raw_documents
是文档(字符串列表)的dask.bag.Bag
。>>> from dask_ml.feature_extraction.text import CountVectorizer >>> import dask.bag as db >>> from distributed import Client >>> client = Client() >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> corpus = db.from_sequence(corpus, npartitions=2) >>> vectorizer = CountVectorizer() >>> X = vectorizer.fit_transform(corpus) dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ... chunktype=scipy.csr_matrix> >>> X.compute().toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 2, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 1]]) >>> vectorizer.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
相关用法
- Python dask_ml.feature_extraction.text.FeatureHasher用法及代码示例
- Python dask_ml.feature_extraction.text.HashingVectorizer用法及代码示例
- Python dask_ml.wrappers.ParallelPostFit用法及代码示例
- Python dask_ml.preprocessing.MinMaxScaler用法及代码示例
- Python dask_ml.preprocessing.Categorizer用法及代码示例
- Python dask_ml.linear_model.LinearRegression用法及代码示例
- Python dask_ml.wrappers.Incremental用法及代码示例
- Python dask_ml.metrics.mean_squared_log_error用法及代码示例
- Python dask_ml.model_selection.GridSearchCV用法及代码示例
- Python dask_ml.preprocessing.OrdinalEncoder用法及代码示例
- Python dask_ml.preprocessing.LabelEncoder用法及代码示例
- Python dask_ml.ensemble.BlockwiseVotingClassifier用法及代码示例
- Python dask_ml.model_selection.train_test_split用法及代码示例
- Python dask_ml.decomposition.PCA用法及代码示例
- Python dask_ml.preprocessing.PolynomialFeatures用法及代码示例
- Python dask_ml.linear_model.LogisticRegression用法及代码示例
- Python dask_ml.xgboost.train用法及代码示例
- Python dask_ml.linear_model.PoissonRegression用法及代码示例
- Python dask_ml.preprocessing.StandardScaler用法及代码示例
- Python dask_ml.preprocessing.QuantileTransformer用法及代码示例
注:本文由纯净天空筛选整理自dask.org大神的英文原创作品 dask_ml.feature_extraction.text.CountVectorizer。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。