用法:
class dask_ml.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>)
將文本文檔集合轉換為令牌計數矩陣
注意:
如果未提供詞匯表,
fit_transform
需要兩次遍曆數據集:一次用於學習詞匯表,第二次用於轉換數據。考慮在不提供vocabulary
時調用fit
或transform
之前在(分布式)內存中保存數據。此外,即使在單台機器上,此實現也受益於具有活動的
dask.distributed.Client
。當客戶端存在時,學習到的vocabulary
會持久保存在分布式內存中,這樣可以節省一些重新計算和冗餘通信。例子:
Dask-ML 實現當前要求
raw_documents
是文檔(字符串列表)的dask.bag.Bag
。>>> from dask_ml.feature_extraction.text import CountVectorizer >>> import dask.bag as db >>> from distributed import Client >>> client = Client() >>> corpus = [ ... 'This is the first document.', ... 'This document is the second document.', ... 'And this is the third one.', ... 'Is this the first document?', ... ] >>> corpus = db.from_sequence(corpus, npartitions=2) >>> vectorizer = CountVectorizer() >>> X = vectorizer.fit_transform(corpus) dask.array<concatenate, shape=(nan, 9), dtype=int64, chunksize=(nan, 9), ... chunktype=scipy.csr_matrix> >>> X.compute().toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 2, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 1, 1, 0, 1, 1, 1], [0, 1, 1, 1, 0, 0, 1, 0, 1]]) >>> vectorizer.get_feature_names() ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
相關用法
- Python dask_ml.feature_extraction.text.FeatureHasher用法及代碼示例
- Python dask_ml.feature_extraction.text.HashingVectorizer用法及代碼示例
- Python dask_ml.wrappers.ParallelPostFit用法及代碼示例
- Python dask_ml.preprocessing.MinMaxScaler用法及代碼示例
- Python dask_ml.preprocessing.Categorizer用法及代碼示例
- Python dask_ml.linear_model.LinearRegression用法及代碼示例
- Python dask_ml.wrappers.Incremental用法及代碼示例
- Python dask_ml.metrics.mean_squared_log_error用法及代碼示例
- Python dask_ml.model_selection.GridSearchCV用法及代碼示例
- Python dask_ml.preprocessing.OrdinalEncoder用法及代碼示例
- Python dask_ml.preprocessing.LabelEncoder用法及代碼示例
- Python dask_ml.ensemble.BlockwiseVotingClassifier用法及代碼示例
- Python dask_ml.model_selection.train_test_split用法及代碼示例
- Python dask_ml.decomposition.PCA用法及代碼示例
- Python dask_ml.preprocessing.PolynomialFeatures用法及代碼示例
- Python dask_ml.linear_model.LogisticRegression用法及代碼示例
- Python dask_ml.xgboost.train用法及代碼示例
- Python dask_ml.linear_model.PoissonRegression用法及代碼示例
- Python dask_ml.preprocessing.StandardScaler用法及代碼示例
- Python dask_ml.preprocessing.QuantileTransformer用法及代碼示例
注:本文由純淨天空篩選整理自dask.org大神的英文原創作品 dask_ml.feature_extraction.text.CountVectorizer。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。