Python pyspark HashingTF用法及代碼示例

本文簡要介紹 pyspark.ml.feature.HashingTF 的用法。

用法: class pyspark.ml.feature.HashingTF(*, numFeatures=262144, binary=False, inputCol=None, outputCol=None)

使用散列技巧將術語序列映射到其術語頻率。目前我們使用 Austin Appleby 的 MurmurHash 3 算法 (MurmurHash3_x86_32) 來計算術語對象的哈希碼值。由於使用簡單的模數將哈希函數轉換為列索引，因此建議使用 2 的冪作為 numFeatures 參數；否則特征將不會均勻地映射到列。

版本 1.3.0 中的新函數。

例子：

>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["words"])
>>> hashingTF = HashingTF(inputCol="words", outputCol="features")
>>> hashingTF.setNumFeatures(10)
HashingTF...
>>> hashingTF.transform(df).head().features
SparseVector(10, {5: 1.0, 7: 1.0, 8: 1.0})
>>> hashingTF.setParams(outputCol="freqs").transform(df).head().freqs
SparseVector(10, {5: 1.0, 7: 1.0, 8: 1.0})
>>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"}
>>> hashingTF.transform(df, params).head().vector
SparseVector(5, {0: 1.0, 2: 1.0, 3: 1.0})
>>> hashingTFPath = temp_path + "/hashing-tf"
>>> hashingTF.save(hashingTFPath)
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True
>>> loadedHashingTF.transform(df).take(1) == hashingTF.transform(df).take(1)
True
>>> hashingTF.indexOf("b")
5

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.HashingTF。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。