本文簡要介紹
pyspark.ml.feature.IDF
的用法。用法:
class pyspark.ml.feature.IDF(*, minDocFreq=0, inputCol=None, outputCol=None)
計算給定文檔集合的逆文檔頻率 (IDF)。
1.4.0 版中的新函數。
例子:
>>> from pyspark.ml.linalg import DenseVector >>> df = spark.createDataFrame([(DenseVector([1.0, 2.0]),), ... (DenseVector([0.0, 1.0]),), (DenseVector([3.0, 0.2]),)], ["tf"]) >>> idf = IDF(minDocFreq=3) >>> idf.setInputCol("tf") IDF... >>> idf.setOutputCol("idf") IDF... >>> model = idf.fit(df) >>> model.setOutputCol("idf") IDFModel... >>> model.getMinDocFreq() 3 >>> model.idf DenseVector([0.0, 0.0]) >>> model.docFreq [0, 3] >>> model.numDocs == df.count() True >>> model.transform(df).head().idf DenseVector([0.0, 0.0]) >>> idf.setParams(outputCol="freqs").fit(df).transform(df).collect()[1].freqs DenseVector([0.0, 0.0]) >>> params = {idf.minDocFreq: 1, idf.outputCol: "vector"} >>> idf.fit(df, params).transform(df).head().vector DenseVector([0.2877, 0.0]) >>> idfPath = temp_path + "/idf" >>> idf.save(idfPath) >>> loadedIdf = IDF.load(idfPath) >>> loadedIdf.getMinDocFreq() == idf.getMinDocFreq() True >>> modelPath = temp_path + "/idf-model" >>> model.save(modelPath) >>> loadedModel = IDFModel.load(modelPath) >>> loadedModel.transform(df).head().idf == model.transform(df).head().idf True
相關用法
- Python pyspark Index.is_monotonic_decreasing用法及代碼示例
- Python pyspark IsotonicRegression用法及代碼示例
- Python pyspark Index.values用法及代碼示例
- Python pyspark Index.drop_duplicates用法及代碼示例
- Python pyspark IndexedRowMatrix.computeGramianMatrix用法及代碼示例
- Python pyspark Index.value_counts用法及代碼示例
- Python pyspark Index.map用法及代碼示例
- Python pyspark Index.equals用法及代碼示例
- Python pyspark Index.argmin用法及代碼示例
- Python pyspark Index.argmax用法及代碼示例
- Python pyspark Index.item用法及代碼示例
- Python pyspark Index.insert用法及代碼示例
- Python pyspark Index.nlevels用法及代碼示例
- Python pyspark Index.min用法及代碼示例
- Python pyspark Index.copy用法及代碼示例
- Python pyspark Int64Index用法及代碼示例
- Python pyspark Index用法及代碼示例
- Python pyspark Index.difference用法及代碼示例
- Python pyspark Index.to_list用法及代碼示例
- Python pyspark Index.shape用法及代碼示例
- Python pyspark Index.dropna用法及代碼示例
- Python pyspark Index.repeat用法及代碼示例
- Python pyspark Index.notna用法及代碼示例
- Python pyspark Index.has_duplicates用法及代碼示例
- Python pyspark IndexedRowMatrix.numRows用法及代碼示例
注:本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.IDF。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。