Python pyspark HashingTF用法及代码示例

本文简要介绍 pyspark.ml.feature.HashingTF 的用法。

用法: class pyspark.ml.feature.HashingTF(*, numFeatures=262144, binary=False, inputCol=None, outputCol=None)

使用散列技巧将术语序列映射到其术语频率。目前我们使用 Austin Appleby 的 MurmurHash 3 算法 (MurmurHash3_x86_32) 来计算术语对象的哈希码值。由于使用简单的模数将哈希函数转换为列索引，因此建议使用 2 的幂作为 numFeatures 参数；否则特征将不会均匀地映射到列。

版本 1.3.0 中的新函数。

例子：

>>> df = spark.createDataFrame([(["a", "b", "c"],)], ["words"])
>>> hashingTF = HashingTF(inputCol="words", outputCol="features")
>>> hashingTF.setNumFeatures(10)
HashingTF...
>>> hashingTF.transform(df).head().features
SparseVector(10, {5: 1.0, 7: 1.0, 8: 1.0})
>>> hashingTF.setParams(outputCol="freqs").transform(df).head().freqs
SparseVector(10, {5: 1.0, 7: 1.0, 8: 1.0})
>>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"}
>>> hashingTF.transform(df, params).head().vector
SparseVector(5, {0: 1.0, 2: 1.0, 3: 1.0})
>>> hashingTFPath = temp_path + "/hashing-tf"
>>> hashingTF.save(hashingTFPath)
>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> loadedHashingTF.getNumFeatures() == hashingTF.getNumFeatures()
True
>>> loadedHashingTF.transform(df).take(1) == hashingTF.transform(df).take(1)
True
>>> hashingTF.indexOf("b")
5

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.HashingTF。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。