Python pyspark VectorIndexer用法及代碼示例

本文簡要介紹 pyspark.ml.feature.VectorIndexer 的用法。

用法: class pyspark.ml.feature.VectorIndexer(*, maxCategories=20, inputCol=None, outputCol=None, handleInvalid='error')

用於索引 Vector 數據集中的分類特征列的類。

這有兩種使用模式：：

自動識別分類特征(默認行為)
- 這有助於將未知向量的數據集處理成具有一些連續特征和一些分類特征的數據集。連續和分類之間的選擇基於 maxCategories 參數。
- 將 maxCategories 設置為任何分類特征應具有的最大分類數。
- 例如：特征 0 具有唯一值 {-1.0, 0.0}，特征 1 具有唯一值 {1.0, 3.0, 5.0}。如果 maxCategories = 2，則特征 0 將被聲明為分類並使用索引 {0, 1}，而特征 1 將被聲明為連續的。
索引所有特征，如果所有特征都是分類的
- 如果 maxCategories 設置為非常大，那麽這將為所有特征建立一個唯一值的索引。
- 警告：如果特征是連續的，這可能會導致問題，因為這會將所有唯一值收集到驅動程序。
- 例如：特征 0 具有唯一值 {-1.0, 0.0}，特征 1 具有唯一值 {1.0, 3.0, 5.0}。如果 maxCategories >= 3，那麽這兩個特征都將被聲明為分類的。

這將返回一個模型，該模型可以將分類特征轉換為使用基於 0 的索引。

index 穩定性：：

這不能保證在多次運行中選擇相同的類別索引。
如果分類特征包括值 0，則保證將值 0 映射到索引 0。這保持向量稀疏性。
未來可能會增加更多的穩定性。

TODO：未來擴展：未來計劃提供以下函數：：

在轉換中保留元數據；如果特征的元數據已經存在，則不要重新計算。
通過參數或通過現有元數據指定不索引的某些特征。
如果分類特征隻有 1 個類別，則添加警告。

1.4.0 版中的新函數。

例子：

>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(Vectors.dense([-1.0, 0.0]),),
...     (Vectors.dense([0.0, 1.0]),), (Vectors.dense([0.0, 2.0]),)], ["a"])
>>> indexer = VectorIndexer(maxCategories=2, inputCol="a")
>>> indexer.setOutputCol("indexed")
VectorIndexer...
>>> model = indexer.fit(df)
>>> indexer.getHandleInvalid()
'error'
>>> model.setOutputCol("output")
VectorIndexerModel...
>>> model.transform(df).head().output
DenseVector([1.0, 0.0])
>>> model.numFeatures
2
>>> model.categoryMaps
{0: {0.0: 0, -1.0: 1}}
>>> indexer.setParams(outputCol="test").fit(df).transform(df).collect()[1].test
DenseVector([0.0, 1.0])
>>> params = {indexer.maxCategories: 3, indexer.outputCol: "vector"}
>>> model2 = indexer.fit(df, params)
>>> model2.transform(df).head().vector
DenseVector([1.0, 0.0])
>>> vectorIndexerPath = temp_path + "/vector-indexer"
>>> indexer.save(vectorIndexerPath)
>>> loadedIndexer = VectorIndexer.load(vectorIndexerPath)
>>> loadedIndexer.getMaxCategories() == indexer.getMaxCategories()
True
>>> modelPath = temp_path + "/vector-indexer-model"
>>> model.save(modelPath)
>>> loadedModel = VectorIndexerModel.load(modelPath)
>>> loadedModel.numFeatures == model.numFeatures
True
>>> loadedModel.categoryMaps == model.categoryMaps
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True
>>> dfWithInvalid = spark.createDataFrame([(Vectors.dense([3.0, 1.0]),)], ["a"])
>>> indexer.getHandleInvalid()
'error'
>>> model3 = indexer.setHandleInvalid("skip").fit(df)
>>> model3.transform(dfWithInvalid).count()
0
>>> model4 = indexer.setParams(handleInvalid="keep", outputCol="indexed").fit(df)
>>> model4.transform(dfWithInvalid).head().indexed
DenseVector([2.0, 1.0])

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.VectorIndexer。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。