Python pyspark RobustScaler用法及代碼示例

本文簡要介紹 pyspark.ml.feature.RobustScaler 的用法。

用法: class pyspark.ml.feature.RobustScaler(*, lower=0.25, upper=0.75, withCentering=False, withScaling=True, inputCol=None, outputCol=None, relativeError=0.001)

RobustScaler 刪除中位數並根據分位數範圍縮放數據。分位數範圍默認為 IQR(四分位數範圍，第一個四分位數 = 第 25 個分位數和第三個四分位數 = 第 75 個分位數之間的分位數範圍)，但可以配置。通過計算訓練集中樣本的相關統計數據，對每個特征獨立進行居中和縮放。然後存儲中位數和分位數範圍，以供以後使用變換方法的數據使用。請注意，在計算中位數和範圍時，NaN 值被忽略。

3.0.0 版中的新函數。

例子：

>>> from pyspark.ml.linalg import Vectors
>>> data = [(0, Vectors.dense([0.0, 0.0]),),
...         (1, Vectors.dense([1.0, -1.0]),),
...         (2, Vectors.dense([2.0, -2.0]),),
...         (3, Vectors.dense([3.0, -3.0]),),
...         (4, Vectors.dense([4.0, -4.0]),),]
>>> df = spark.createDataFrame(data, ["id", "features"])
>>> scaler = RobustScaler()
>>> scaler.setInputCol("features")
RobustScaler...
>>> scaler.setOutputCol("scaled")
RobustScaler...
>>> model = scaler.fit(df)
>>> model.setOutputCol("output")
RobustScalerModel...
>>> model.median
DenseVector([2.0, -2.0])
>>> model.range
DenseVector([2.0, 2.0])
>>> model.transform(df).collect()[1].output
DenseVector([0.5, -0.5])
>>> scalerPath = temp_path + "/robust-scaler"
>>> scaler.save(scalerPath)
>>> loadedScaler = RobustScaler.load(scalerPath)
>>> loadedScaler.getWithCentering() == scaler.getWithCentering()
True
>>> loadedScaler.getWithScaling() == scaler.getWithScaling()
True
>>> modelPath = temp_path + "/robust-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = RobustScalerModel.load(modelPath)
>>> loadedModel.median == model.median
True
>>> loadedModel.range == model.range
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.RobustScaler。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。