本文简要介绍
pyspark.ml.feature.VectorAssembler
的用法。用法:
class pyspark.ml.feature.VectorAssembler(*, inputCols=None, outputCol=None, handleInvalid='error')
将多列合并为向量列的特征转换器。
1.4.0 版中的新函数。
例子:
>>> df = spark.createDataFrame([(1, 0, 3)], ["a", "b", "c"]) >>> vecAssembler = VectorAssembler(outputCol="features") >>> vecAssembler.setInputCols(["a", "b", "c"]) VectorAssembler... >>> vecAssembler.transform(df).head().features DenseVector([1.0, 0.0, 3.0]) >>> vecAssembler.setParams(outputCol="freqs").transform(df).head().freqs DenseVector([1.0, 0.0, 3.0]) >>> params = {vecAssembler.inputCols: ["b", "a"], vecAssembler.outputCol: "vector"} >>> vecAssembler.transform(df, params).head().vector DenseVector([0.0, 1.0]) >>> vectorAssemblerPath = temp_path + "/vector-assembler" >>> vecAssembler.save(vectorAssemblerPath) >>> loadedAssembler = VectorAssembler.load(vectorAssemblerPath) >>> loadedAssembler.transform(df).head().freqs == vecAssembler.transform(df).head().freqs True >>> dfWithNullsAndNaNs = spark.createDataFrame( ... [(1.0, 2.0, None), (3.0, float("nan"), 4.0), (5.0, 6.0, 7.0)], ["a", "b", "c"]) >>> vecAssembler2 = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features", ... handleInvalid="keep") >>> vecAssembler2.transform(dfWithNullsAndNaNs).show() +---+---+----+-------------+ | a| b| c| features| +---+---+----+-------------+ |1.0|2.0|null|[1.0,2.0,NaN]| |3.0|NaN| 4.0|[3.0,NaN,4.0]| |5.0|6.0| 7.0|[5.0,6.0,7.0]| +---+---+----+-------------+ ... >>> vecAssembler2.setParams(handleInvalid="skip").transform(dfWithNullsAndNaNs).show() +---+---+---+-------------+ | a| b| c| features| +---+---+---+-------------+ |5.0|6.0|7.0|[5.0,6.0,7.0]| +---+---+---+-------------+ ...
相关用法
- Python pyspark VectorSlicer用法及代码示例
- Python pyspark VectorSizeHint用法及代码示例
- Python pyspark Vectors.stringify用法及代码示例
- Python pyspark Vectors.squared_distance用法及代码示例
- Python pyspark Vectors.parse用法及代码示例
- Python pyspark Vectors.dense用法及代码示例
- Python pyspark VectorIndexer用法及代码示例
- Python pyspark Vectors.sparse用法及代码示例
- Python pyspark VersionUtils.majorMinorVersion用法及代码示例
- Python pyspark VarianceThresholdSelector用法及代码示例
- Python pyspark create_map用法及代码示例
- Python pyspark date_add用法及代码示例
- Python pyspark DataFrame.to_latex用法及代码示例
- Python pyspark DataStreamReader.schema用法及代码示例
- Python pyspark MultiIndex.size用法及代码示例
- Python pyspark arrays_overlap用法及代码示例
- Python pyspark Series.asof用法及代码示例
- Python pyspark DataFrame.align用法及代码示例
- Python pyspark Index.is_monotonic_decreasing用法及代码示例
- Python pyspark IsotonicRegression用法及代码示例
- Python pyspark DataFrame.plot.bar用法及代码示例
- Python pyspark DataFrame.to_delta用法及代码示例
- Python pyspark element_at用法及代码示例
- Python pyspark explode用法及代码示例
- Python pyspark MultiIndex.hasnans用法及代码示例
注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.VectorAssembler。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。