本文简要介绍
pyspark.ml.feature.CountVectorizer
的用法。用法:
class pyspark.ml.feature.CountVectorizer(*, minTF=1.0, minDF=1.0, maxDF=9223372036854775807, vocabSize=262144, binary=False, inputCol=None, outputCol=None)
从文档集合中提取词汇并生成
CountVectorizerModel
。版本 1.6.0 中的新函数。
例子:
>>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ... ["label", "raw"]) >>> cv = CountVectorizer() >>> cv.setInputCol("raw") CountVectorizer... >>> cv.setOutputCol("vectors") CountVectorizer... >>> model = cv.fit(df) >>> model.setInputCol("raw") CountVectorizerModel... >>> model.transform(df).show(truncate=False) +-----+---------------+-------------------------+ |label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])| |1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])| +-----+---------------+-------------------------+ ... >>> sorted(model.vocabulary) == ['a', 'b', 'c'] True >>> countVectorizerPath = temp_path + "/count-vectorizer" >>> cv.save(countVectorizerPath) >>> loadedCv = CountVectorizer.load(countVectorizerPath) >>> loadedCv.getMinDF() == cv.getMinDF() True >>> loadedCv.getMinTF() == cv.getMinTF() True >>> loadedCv.getVocabSize() == cv.getVocabSize() True >>> modelPath = temp_path + "/count-vectorizer-model" >>> model.save(modelPath) >>> loadedModel = CountVectorizerModel.load(modelPath) >>> loadedModel.vocabulary == model.vocabulary True >>> loadedModel.transform(df).take(1) == model.transform(df).take(1) True >>> fromVocabModel = CountVectorizerModel.from_vocabulary(["a", "b", "c"], ... inputCol="raw", outputCol="vectors") >>> fromVocabModel.transform(df).show(truncate=False) +-----+---------------+-------------------------+ |label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])| |1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])| +-----+---------------+-------------------------+ ...
相关用法
- Python pyspark Column.withField用法及代码示例
- Python pyspark Column.eqNullSafe用法及代码示例
- Python pyspark Column.desc_nulls_first用法及代码示例
- Python pyspark Column.rlike用法及代码示例
- Python pyspark Column.substr用法及代码示例
- Python pyspark Column.when用法及代码示例
- Python pyspark Column.isNotNull用法及代码示例
- Python pyspark CoordinateMatrix.entries用法及代码示例
- Python pyspark Column.bitwiseAND用法及代码示例
- Python pyspark Column.isNull用法及代码示例
- Python pyspark CoordinateMatrix.numCols用法及代码示例
- Python pyspark Column.between用法及代码示例
- Python pyspark Column.contains用法及代码示例
- Python pyspark CoordinateMatrix.toRowMatrix用法及代码示例
- Python pyspark Column.cast用法及代码示例
- Python pyspark Column.like用法及代码示例
- Python pyspark Column.endswith用法及代码示例
- Python pyspark Column.asc_nulls_first用法及代码示例
- Python pyspark Column.getField用法及代码示例
- Python pyspark Column.isin用法及代码示例
- Python pyspark CoordinateMatrix.toIndexedRowMatrix用法及代码示例
- Python pyspark Column.dropFields用法及代码示例
- Python pyspark Correlation.corr用法及代码示例
- Python pyspark Column.bitwiseOR用法及代码示例
- Python pyspark Column.desc_nulls_last用法及代码示例
注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.CountVectorizer。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。