Python pyspark RandomForestClassifier用法及代码示例

本文简要介绍 pyspark.ml.classification.RandomForestClassifier 的用法。

用法: class pyspark.ml.classification.RandomForestClassifier(*, featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity='gini', numTrees=20, featureSubsetStrategy='auto', seed=None, subsamplingRate=1.0, leafCol='', minWeightFractionPerNode=0.0, weightCol=None, bootstrap=True)

Random Forest 学习分类算法。它支持二进制和多类标签，以及连续和分类特征。

1.4.0 版中的新函数。

例子：

>>> import numpy
>>> from numpy import allclose
>>> from pyspark.ml.linalg import Vectors
>>> from pyspark.ml.feature import StringIndexer
>>> df = spark.createDataFrame([
...     (1.0, Vectors.dense(1.0)),
...     (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
>>> si_model = stringIndexer.fit(df)
>>> td = si_model.transform(df)
>>> rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42,
...     leafCol="leafId")
>>> rf.getMinWeightFractionPerNode()
0.0
>>> model = rf.fit(td)
>>> model.getLabelCol()
'indexed'
>>> model.setFeaturesCol("features")
RandomForestClassificationModel...
>>> model.setRawPredictionCol("newRawPrediction")
RandomForestClassificationModel...
>>> model.getBootstrap()
True
>>> model.getRawPredictionCol()
'newRawPrediction'
>>> model.featureImportances
SparseVector(1, {0: 1.0})
>>> allclose(model.treeWeights, [1.0, 1.0, 1.0])
True
>>> test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
>>> model.predict(test0.head().features)
0.0
>>> model.predictRaw(test0.head().features)
DenseVector([2.0, 0.0])
>>> model.predictProbability(test0.head().features)
DenseVector([1.0, 0.0])
>>> result = model.transform(test0).head()
>>> result.prediction
0.0
>>> numpy.argmax(result.probability)
0
>>> numpy.argmax(result.newRawPrediction)
0
>>> result.leafId
DenseVector([0.0, 0.0, 0.0])
>>> test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
>>> model.transform(test1).head().prediction
1.0
>>> model.trees
[DecisionTreeClassificationModel...depth=..., DecisionTreeClassificationModel...]
>>> rfc_path = temp_path + "/rfc"
>>> rf.save(rfc_path)
>>> rf2 = RandomForestClassifier.load(rfc_path)
>>> rf2.getNumTrees()
3
>>> model_path = temp_path + "/rfc_model"
>>> model.save(model_path)
>>> model2 = RandomForestClassificationModel.load(model_path)
>>> model.featureImportances == model2.featureImportances
True
>>> model.transform(test0).take(1) == model2.transform(test0).take(1)
True

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.classification.RandomForestClassifier。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。