本文简要介绍
pyspark.ml.feature.Imputer
的用法。用法:
class pyspark.ml.feature.Imputer(*, strategy='mean', missingValue=nan, inputCols=None, outputCols=None, inputCol=None, outputCol=None, relativeError=0.001)
使用缺失值所在列的平均值、中位数或众数来完成缺失值的插补估计器。输入列应该是数字类型。目前 Imputer 不支持分类特征,并且可能为分类特征创建不正确的值。
请注意,均值/中值/众数是在过滤掉缺失值后计算的。输入列中的所有 Null 值都被视为缺失,因此也被估算。对于计算中位数,使用
pyspark.sql.DataFrame.approxQuantile()
,相对误差为0.001
。2.2.0 版中的新函数。
例子:
>>> df = spark.createDataFrame([(1.0, float("nan")), (2.0, float("nan")), (float("nan"), 3.0), ... (4.0, 4.0), (5.0, 5.0)], ["a", "b"]) >>> imputer = Imputer() >>> imputer.setInputCols(["a", "b"]) Imputer... >>> imputer.setOutputCols(["out_a", "out_b"]) Imputer... >>> imputer.getRelativeError() 0.001 >>> model = imputer.fit(df) >>> model.setInputCols(["a", "b"]) ImputerModel... >>> model.getStrategy() 'mean' >>> model.surrogateDF.show() +---+---+ | a| b| +---+---+ |3.0|4.0| +---+---+ ... >>> model.transform(df).show() +---+---+-----+-----+ | a| b|out_a|out_b| +---+---+-----+-----+ |1.0|NaN| 1.0| 4.0| |2.0|NaN| 2.0| 4.0| |NaN|3.0| 3.0| 3.0| ... >>> imputer.setStrategy("median").setMissingValue(1.0).fit(df).transform(df).show() +---+---+-----+-----+ | a| b|out_a|out_b| +---+---+-----+-----+ |1.0|NaN| 4.0| NaN| ... >>> df1 = spark.createDataFrame([(1.0,), (2.0,), (float("nan"),), (4.0,), (5.0,)], ["a"]) >>> imputer1 = Imputer(inputCol="a", outputCol="out_a") >>> model1 = imputer1.fit(df1) >>> model1.surrogateDF.show() +---+ | a| +---+ |3.0| +---+ ... >>> model1.transform(df1).show() +---+-----+ | a|out_a| +---+-----+ |1.0| 1.0| |2.0| 2.0| |NaN| 3.0| ... >>> imputer1.setStrategy("median").setMissingValue(1.0).fit(df1).transform(df1).show() +---+-----+ | a|out_a| +---+-----+ |1.0| 4.0| ... >>> df2 = spark.createDataFrame([(float("nan"),), (float("nan"),), (3.0,), (4.0,), (5.0,)], ... ["b"]) >>> imputer2 = Imputer(inputCol="b", outputCol="out_b") >>> model2 = imputer2.fit(df2) >>> model2.surrogateDF.show() +---+ | b| +---+ |4.0| +---+ ... >>> model2.transform(df2).show() +---+-----+ | b|out_b| +---+-----+ |NaN| 4.0| |NaN| 4.0| |3.0| 3.0| ... >>> imputer2.setStrategy("median").setMissingValue(1.0).fit(df2).transform(df2).show() +---+-----+ | b|out_b| +---+-----+ |NaN| NaN| ... >>> imputerPath = temp_path + "/imputer" >>> imputer.save(imputerPath) >>> loadedImputer = Imputer.load(imputerPath) >>> loadedImputer.getStrategy() == imputer.getStrategy() True >>> loadedImputer.getMissingValue() 1.0 >>> modelPath = temp_path + "/imputer-model" >>> model.save(modelPath) >>> loadedModel = ImputerModel.load(modelPath) >>> loadedModel.transform(df).head().out_a == model.transform(df).head().out_a True
相关用法
- Python pyspark Index.is_monotonic_decreasing用法及代码示例
- Python pyspark IsotonicRegression用法及代码示例
- Python pyspark Index.values用法及代码示例
- Python pyspark Index.drop_duplicates用法及代码示例
- Python pyspark IndexedRowMatrix.computeGramianMatrix用法及代码示例
- Python pyspark Index.value_counts用法及代码示例
- Python pyspark IDF用法及代码示例
- Python pyspark Index.map用法及代码示例
- Python pyspark Index.equals用法及代码示例
- Python pyspark Index.argmin用法及代码示例
- Python pyspark Index.argmax用法及代码示例
- Python pyspark Index.item用法及代码示例
- Python pyspark Index.insert用法及代码示例
- Python pyspark Index.nlevels用法及代码示例
- Python pyspark Index.min用法及代码示例
- Python pyspark Index.copy用法及代码示例
- Python pyspark Int64Index用法及代码示例
- Python pyspark Index用法及代码示例
- Python pyspark Index.difference用法及代码示例
- Python pyspark Index.to_list用法及代码示例
- Python pyspark Index.shape用法及代码示例
- Python pyspark Index.dropna用法及代码示例
- Python pyspark Index.repeat用法及代码示例
- Python pyspark Index.notna用法及代码示例
- Python pyspark Index.has_duplicates用法及代码示例
注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.Imputer。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。