Python pyspark NaiveBayes用法及代码示例

本文简要介绍 pyspark.ml.classification.NaiveBayes 的用法。

用法: class pyspark.ml.classification.NaiveBayes(*, featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', smoothing=1.0, modelType='multinomial', thresholds=None, weightCol=None)

朴素贝叶斯分类器。它同时支持多项式和伯努利 NB。 Multinomial NB 可以处理有限支持的离散数据。例如，通过将文档转换为TF-IDF 向量，它可以用于文档分类。通过使每个向量成为二进制 (0/1) 数据，它也可以用作 Bernoulli NB 。

多项式 NB 和伯努利 NB 的输入特征值必须为非负数。从 3.0.0 开始，它支持 Complement NB，这是对 Multinomial NB 的改编。具体来说，Complement NB 使用来自每个类的补集的统计数据来计算模型的系数。 Complement NB 的发明者凭经验表明 CNB 的参数估计比多项式 NB 的参数估计更稳定。与多项式 NB 一样，Complement NB 的输入特征值必须是非负的。从 3.0.0 开始，它还支持 Gaussian NB 。可以处理连续数据。

1.5.0 版中的新函数。

例子：

>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
...     Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),
...     Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),
...     Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])
>>> nb = NaiveBayes(smoothing=1.0, modelType="multinomial", weightCol="weight")
>>> model = nb.fit(df)
>>> model.setFeaturesCol("features")
NaiveBayesModel...
>>> model.getSmoothing()
1.0
>>> model.pi
DenseVector([-0.81..., -0.58...])
>>> model.theta
DenseMatrix(2, 2, [-0.91..., -0.51..., -0.40..., -1.09...], 1)
>>> model.sigma
DenseMatrix(0, 0, [...], ...)
>>> test0 = sc.parallelize([Row(features=Vectors.dense([1.0, 0.0]))]).toDF()
>>> model.predict(test0.head().features)
1.0
>>> model.predictRaw(test0.head().features)
DenseVector([-1.72..., -0.99...])
>>> model.predictProbability(test0.head().features)
DenseVector([0.32..., 0.67...])
>>> result = model.transform(test0).head()
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.32..., 0.67...])
>>> result.rawPrediction
DenseVector([-1.72..., -0.99...])
>>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
>>> model.transform(test1).head().prediction
1.0
>>> nb_path = temp_path + "/nb"
>>> nb.save(nb_path)
>>> nb2 = NaiveBayes.load(nb_path)
>>> nb2.getSmoothing()
1.0
>>> model_path = temp_path + "/nb_model"
>>> model.save(model_path)
>>> model2 = NaiveBayesModel.load(model_path)
>>> model.pi == model2.pi
True
>>> model.theta == model2.theta
True
>>> model.transform(test0).take(1) == model2.transform(test0).take(1)
True
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0
>>> nb3 = NaiveBayes().setModelType("gaussian")
>>> model4 = nb3.fit(df)
>>> model4.getModelType()
'gaussian'
>>> model4.sigma
DenseMatrix(2, 2, [0.0, 0.25, 0.0, 0.0], 1)
>>> nb5 = NaiveBayes(smoothing=1.0, modelType="complement", weightCol="weight")
>>> model5 = nb5.fit(df)
>>> model5.getModelType()
'complement'
>>> model5.theta
DenseMatrix(2, 2, [...], 1)
>>> model5.sigma
DenseMatrix(0, 0, [...], ...)

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.ml.classification.NaiveBayes。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。