当前位置: 首页>>代码示例 >>用法及示例精选 >>正文


Python pyspark VectorAssembler用法及代码示例


本文简要介绍 pyspark.ml.feature.VectorAssembler 的用法。

用法:

class pyspark.ml.feature.VectorAssembler(*, inputCols=None, outputCol=None, handleInvalid='error')

将多列合并为向量列的特征转换器。

1.4.0 版中的新函数。

例子

>>> df = spark.createDataFrame([(1, 0, 3)], ["a", "b", "c"])
>>> vecAssembler = VectorAssembler(outputCol="features")
>>> vecAssembler.setInputCols(["a", "b", "c"])
VectorAssembler...
>>> vecAssembler.transform(df).head().features
DenseVector([1.0, 0.0, 3.0])
>>> vecAssembler.setParams(outputCol="freqs").transform(df).head().freqs
DenseVector([1.0, 0.0, 3.0])
>>> params = {vecAssembler.inputCols: ["b", "a"], vecAssembler.outputCol: "vector"}
>>> vecAssembler.transform(df, params).head().vector
DenseVector([0.0, 1.0])
>>> vectorAssemblerPath = temp_path + "/vector-assembler"
>>> vecAssembler.save(vectorAssemblerPath)
>>> loadedAssembler = VectorAssembler.load(vectorAssemblerPath)
>>> loadedAssembler.transform(df).head().freqs == vecAssembler.transform(df).head().freqs
True
>>> dfWithNullsAndNaNs = spark.createDataFrame(
...    [(1.0, 2.0, None), (3.0, float("nan"), 4.0), (5.0, 6.0, 7.0)], ["a", "b", "c"])
>>> vecAssembler2 = VectorAssembler(inputCols=["a", "b", "c"], outputCol="features",
...    handleInvalid="keep")
>>> vecAssembler2.transform(dfWithNullsAndNaNs).show()
+---+---+----+-------------+
|  a|  b|   c|     features|
+---+---+----+-------------+
|1.0|2.0|null|[1.0,2.0,NaN]|
|3.0|NaN| 4.0|[3.0,NaN,4.0]|
|5.0|6.0| 7.0|[5.0,6.0,7.0]|
+---+---+----+-------------+
...
>>> vecAssembler2.setParams(handleInvalid="skip").transform(dfWithNullsAndNaNs).show()
+---+---+---+-------------+
|  a|  b|  c|     features|
+---+---+---+-------------+
|5.0|6.0|7.0|[5.0,6.0,7.0]|
+---+---+---+-------------+
...

相关用法


注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.VectorAssembler。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。