本文简要介绍
pyspark.ml.feature.Bucketizer
的用法。用法:
class pyspark.ml.feature.Bucketizer(*, splits=None, inputCol=None, outputCol=None, handleInvalid='error', splitsArray=None, inputCols=None, outputCols=None)
将一列连续特征映射到一列特征桶。从3.0.0开始,
Bucketizer
可以通过设置inputCols
参数一次映射多个列。请注意,当同时设置inputCol
和inputCols
参数时,将抛出异常。splits
参数仅用于单列使用,splitsArray
用于多列使用。1.4.0 版中的新函数。
例子:
>>> values = [(0.1, 0.0), (0.4, 1.0), (1.2, 1.3), (1.5, float("nan")), ... (float("nan"), 1.0), (float("nan"), 0.0)] >>> df = spark.createDataFrame(values, ["values1", "values2"]) >>> bucketizer = Bucketizer() >>> bucketizer.setSplits([-float("inf"), 0.5, 1.4, float("inf")]) Bucketizer... >>> bucketizer.setInputCol("values1") Bucketizer... >>> bucketizer.setOutputCol("buckets") Bucketizer... >>> bucketed = bucketizer.setHandleInvalid("keep").transform(df).collect() >>> bucketed = bucketizer.setHandleInvalid("keep").transform(df.select("values1")) >>> bucketed.show(truncate=False) +-------+-------+ |values1|buckets| +-------+-------+ |0.1 |0.0 | |0.4 |0.0 | |1.2 |1.0 | |1.5 |2.0 | |NaN |3.0 | |NaN |3.0 | +-------+-------+ ... >>> bucketizer.setParams(outputCol="b").transform(df).head().b 0.0 >>> bucketizerPath = temp_path + "/bucketizer" >>> bucketizer.save(bucketizerPath) >>> loadedBucketizer = Bucketizer.load(bucketizerPath) >>> loadedBucketizer.getSplits() == bucketizer.getSplits() True >>> loadedBucketizer.transform(df).take(1) == bucketizer.transform(df).take(1) True >>> bucketed = bucketizer.setHandleInvalid("skip").transform(df).collect() >>> len(bucketed) 4 >>> bucketizer2 = Bucketizer(splitsArray= ... [[-float("inf"), 0.5, 1.4, float("inf")], [-float("inf"), 0.5, float("inf")]], ... inputCols=["values1", "values2"], outputCols=["buckets1", "buckets2"]) >>> bucketed2 = bucketizer2.setHandleInvalid("keep").transform(df) >>> bucketed2.show(truncate=False) +-------+-------+--------+--------+ |values1|values2|buckets1|buckets2| +-------+-------+--------+--------+ |0.1 |0.0 |0.0 |0.0 | |0.4 |1.0 |0.0 |1.0 | |1.2 |1.3 |1.0 |1.0 | |1.5 |NaN |2.0 |2.0 | |NaN |1.0 |3.0 |1.0 | |NaN |0.0 |3.0 |0.0 | +-------+-------+--------+--------+ ...
相关用法
- Python pyspark BucketedRandomProjectionLSH用法及代码示例
- Python pyspark BlockMatrix.add用法及代码示例
- Python pyspark BisectingKMeans用法及代码示例
- Python pyspark BlockMatrix.colsPerBlock用法及代码示例
- Python pyspark BlockMatrix.subtract用法及代码示例
- Python pyspark BlockMatrix.toLocalMatrix用法及代码示例
- Python pyspark BlockMatrix.toIndexedRowMatrix用法及代码示例
- Python pyspark BlockMatrix.rowsPerBlock用法及代码示例
- Python pyspark BlockMatrix.numCols用法及代码示例
- Python pyspark BlockMatrix.numColBlocks用法及代码示例
- Python pyspark BlockMatrix.numRowBlocks用法及代码示例
- Python pyspark BinaryClassificationEvaluator用法及代码示例
- Python pyspark BlockMatrix.toCoordinateMatrix用法及代码示例
- Python pyspark BinaryClassificationMetrics用法及代码示例
- Python pyspark BlockMatrix.transpose用法及代码示例
- Python pyspark BisectingKMeansModel用法及代码示例
- Python pyspark BlockMatrix.numRows用法及代码示例
- Python pyspark BlockMatrix.multiply用法及代码示例
- Python pyspark Broadcast用法及代码示例
- Python pyspark BlockMatrix.blocks用法及代码示例
- Python pyspark Binarizer用法及代码示例
- Python pyspark create_map用法及代码示例
- Python pyspark date_add用法及代码示例
- Python pyspark DataFrame.to_latex用法及代码示例
- Python pyspark DataStreamReader.schema用法及代码示例
注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.Bucketizer。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。