本文简要介绍
pyspark.ml.feature.RFormula
的用法。用法:
class pyspark.ml.feature.RFormula(*, formula=None, featuresCol='features', labelCol='label', forceIndexLabel=False, stringIndexerOrderType='frequencyDesc', handleInvalid='error')
实现根据 R 模型公式拟合数据集所需的转换。目前,我们支持有限的 R 运算符子集,包括“~”、“.”、“:”、“+”、“-”、“*”和 ‘^’。
1.5.0 版中的新函数。
注意:
另请参阅 R formula docs 。
例子:
>>> df = spark.createDataFrame([ ... (1.0, 1.0, "a"), ... (0.0, 2.0, "b"), ... (0.0, 0.0, "a") ... ], ["y", "x", "s"]) >>> rf = RFormula(formula="y ~ x + s") >>> model = rf.fit(df) >>> model.getLabelCol() 'label' >>> model.transform(df).show() +---+---+---+---------+-----+ | y| x| s| features|label| +---+---+---+---------+-----+ |1.0|1.0| a|[1.0,1.0]| 1.0| |0.0|2.0| b|[2.0,0.0]| 0.0| |0.0|0.0| a|[0.0,1.0]| 0.0| +---+---+---+---------+-----+ ... >>> rf.fit(df, {rf.formula: "y ~ . - s"}).transform(df).show() +---+---+---+--------+-----+ | y| x| s|features|label| +---+---+---+--------+-----+ |1.0|1.0| a| [1.0]| 1.0| |0.0|2.0| b| [2.0]| 0.0| |0.0|0.0| a| [0.0]| 0.0| +---+---+---+--------+-----+ ... >>> rFormulaPath = temp_path + "/rFormula" >>> rf.save(rFormulaPath) >>> loadedRF = RFormula.load(rFormulaPath) >>> loadedRF.getFormula() == rf.getFormula() True >>> loadedRF.getFeaturesCol() == rf.getFeaturesCol() True >>> loadedRF.getLabelCol() == rf.getLabelCol() True >>> loadedRF.getHandleInvalid() == rf.getHandleInvalid() True >>> str(loadedRF) 'RFormula(y ~ x + s) (uid=...)' >>> modelPath = temp_path + "/rFormulaModel" >>> model.save(modelPath) >>> loadedModel = RFormulaModel.load(modelPath) >>> loadedModel.uid == model.uid True >>> loadedModel.transform(df).show() +---+---+---+---------+-----+ | y| x| s| features|label| +---+---+---+---------+-----+ |1.0|1.0| a|[1.0,1.0]| 1.0| |0.0|2.0| b|[2.0,0.0]| 0.0| |0.0|0.0| a|[0.0,1.0]| 0.0| +---+---+---+---------+-----+ ... >>> str(loadedModel) 'RFormulaModel(ResolvedRFormula(label=y, terms=[x,s], hasIntercept=true)) (uid=...)'
相关用法
- Python pyspark RDD.saveAsTextFile用法及代码示例
- Python pyspark RDD.keyBy用法及代码示例
- Python pyspark RDD.sumApprox用法及代码示例
- Python pyspark RowMatrix.numCols用法及代码示例
- Python pyspark RowMatrix.computePrincipalComponents用法及代码示例
- Python pyspark RDD.lookup用法及代码示例
- Python pyspark RDD.zipWithIndex用法及代码示例
- Python pyspark RDD.sampleByKey用法及代码示例
- Python pyspark Rolling.mean用法及代码示例
- Python pyspark Rolling.max用法及代码示例
- Python pyspark RDD.coalesce用法及代码示例
- Python pyspark RDD.subtract用法及代码示例
- Python pyspark RDD.count用法及代码示例
- Python pyspark RankingEvaluator用法及代码示例
- Python pyspark RandomRDDs.uniformRDD用法及代码示例
- Python pyspark RDD.groupWith用法及代码示例
- Python pyspark RDD.distinct用法及代码示例
- Python pyspark RDD.treeAggregate用法及代码示例
- Python pyspark RowMatrix.computeSVD用法及代码示例
- Python pyspark RowMatrix.multiply用法及代码示例
- Python pyspark RandomForest.trainRegressor用法及代码示例
- Python pyspark RandomRDDs.exponentialRDD用法及代码示例
- Python pyspark RDD.mapPartitionsWithIndex用法及代码示例
- Python pyspark Row.asDict用法及代码示例
- Python pyspark RandomRDDs.gammaRDD用法及代码示例
注:本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.feature.RFormula。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。