pyspark.ml.feature.UnivariateFeatureSelector
的用法。用法:
class pyspark.ml.feature.UnivariateFeatureSelector(*, featuresCol='features', outputCol=None, labelCol='label', selectionMode='numTopFeatures')
基於針對標簽的單變量統計測試的特征選擇器。目前,Spark 支持三種單變量特征選擇器:卡方、ANOVA F-test 和 F-value。用戶可以通過設置
featureType
和labelType
來選擇單變量特征選擇器,Spark 將根據指定的featureType
和labelType
選擇評分函數。支持以下
featureType
和labelType
的組合:featureType
categorical
和labelType
categorical
,Spark 使用卡方,即 sklearn 中的 chi2。featureType
continuous
和labelType
categorical
,Spark 使用 ANOVA F-test,即 sklearn 中的 f_classif。featureType
continuous
和labelType
continuous
,Spark使用F-value,即sklearn中的f_regression。
UnivariateFeatureSelector
支持不同的選擇模式:numTopFeatures
、percentile
、fpr
、fdr
、fwe
。numTopFeatures
根據一個假設選擇固定數量的頂級特征。percentile
類似,但選擇所有特征的一小部分而不是固定數量。fpr
選擇 p 值低於閾值的所有特征,從而控製選擇的誤報率。fdr
使用 Benjamini-Hochberg procedure 選擇錯誤發現率低於閾值的所有特征。fwe
選擇 p 值低於閾值的所有特征。閾值按 1 /numFeatures
縮放,從而控製 family-wise 選擇錯誤率。
默認情況下,選擇模式為
numTopFeatures
。版本 3.1.1 中的新函數。
例子:
>>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame( ... [(Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3]), 3.0), ... (Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1]), 2.0), ... (Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5]), 1.0), ... (Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8]), 2.0), ... (Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0]), 4.0), ... (Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]), 4.0)], ... ["features", "label"]) >>> selector = UnivariateFeatureSelector(outputCol="selectedFeatures") >>> selector.setFeatureType("continuous").setLabelType("categorical").setSelectionThreshold(1) UnivariateFeatureSelector... >>> model = selector.fit(df) >>> model.getFeaturesCol() 'features' >>> model.setFeaturesCol("features") UnivariateFeatureSelectorModel... >>> model.transform(df).head().selectedFeatures DenseVector([7.6]) >>> model.selectedFeatures [2] >>> selectorPath = temp_path + "/selector" >>> selector.save(selectorPath) >>> loadedSelector = UnivariateFeatureSelector.load(selectorPath) >>> loadedSelector.getSelectionThreshold() == selector.getSelectionThreshold() True >>> modelPath = temp_path + "/selector-model" >>> model.save(modelPath) >>> loadedModel = UnivariateFeatureSelectorModel.load(modelPath) >>> loadedModel.selectedFeatures == model.selectedFeatures True >>> loadedModel.transform(df).take(1) == model.transform(df).take(1) True
相關用法
- Python pyspark create_map用法及代碼示例
- Python pyspark date_add用法及代碼示例
- Python pyspark DataFrame.to_latex用法及代碼示例
- Python pyspark DataStreamReader.schema用法及代碼示例
- Python pyspark MultiIndex.size用法及代碼示例
- Python pyspark arrays_overlap用法及代碼示例
- Python pyspark Series.asof用法及代碼示例
- Python pyspark DataFrame.align用法及代碼示例
- Python pyspark Index.is_monotonic_decreasing用法及代碼示例
- Python pyspark IsotonicRegression用法及代碼示例
- Python pyspark DataFrame.plot.bar用法及代碼示例
- Python pyspark DataFrame.to_delta用法及代碼示例
- Python pyspark element_at用法及代碼示例
- Python pyspark explode用法及代碼示例
- Python pyspark MultiIndex.hasnans用法及代碼示例
- Python pyspark Series.to_frame用法及代碼示例
- Python pyspark DataFrame.quantile用法及代碼示例
- Python pyspark Column.withField用法及代碼示例
- Python pyspark Index.values用法及代碼示例
- Python pyspark Index.drop_duplicates用法及代碼示例
- Python pyspark aggregate用法及代碼示例
- Python pyspark IndexedRowMatrix.computeGramianMatrix用法及代碼示例
- Python pyspark DecisionTreeClassifier用法及代碼示例
- Python pyspark Index.value_counts用法及代碼示例
- Python pyspark GroupBy.mean用法及代碼示例
注:本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.feature.UnivariateFeatureSelector。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。