Python pyspark NaiveBayes用法及代碼示例

本文簡要介紹 pyspark.ml.classification.NaiveBayes 的用法。

用法: class pyspark.ml.classification.NaiveBayes(*, featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction', smoothing=1.0, modelType='multinomial', thresholds=None, weightCol=None)

樸素貝葉斯分類器。它同時支持多項式和伯努利 NB。 Multinomial NB 可以處理有限支持的離散數據。例如，通過將文檔轉換為TF-IDF 向量，它可以用於文檔分類。通過使每個向量成為二進製 (0/1) 數據，它也可以用作 Bernoulli NB 。

多項式 NB 和伯努利 NB 的輸入特征值必須為非負數。從 3.0.0 開始，它支持 Complement NB，這是對 Multinomial NB 的改編。具體來說，Complement NB 使用來自每個類的補集的統計數據來計算模型的係數。 Complement NB 的發明者憑經驗表明 CNB 的參數估計比多項式 NB 的參數估計更穩定。與多項式 NB 一樣，Complement NB 的輸入特征值必須是非負的。從 3.0.0 開始，它還支持 Gaussian NB 。可以處理連續數據。

1.5.0 版中的新函數。

例子：

>>> from pyspark.sql import Row
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([
...     Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),
...     Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),
...     Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])
>>> nb = NaiveBayes(smoothing=1.0, modelType="multinomial", weightCol="weight")
>>> model = nb.fit(df)
>>> model.setFeaturesCol("features")
NaiveBayesModel...
>>> model.getSmoothing()
1.0
>>> model.pi
DenseVector([-0.81..., -0.58...])
>>> model.theta
DenseMatrix(2, 2, [-0.91..., -0.51..., -0.40..., -1.09...], 1)
>>> model.sigma
DenseMatrix(0, 0, [...], ...)
>>> test0 = sc.parallelize([Row(features=Vectors.dense([1.0, 0.0]))]).toDF()
>>> model.predict(test0.head().features)
1.0
>>> model.predictRaw(test0.head().features)
DenseVector([-1.72..., -0.99...])
>>> model.predictProbability(test0.head().features)
DenseVector([0.32..., 0.67...])
>>> result = model.transform(test0).head()
>>> result.prediction
1.0
>>> result.probability
DenseVector([0.32..., 0.67...])
>>> result.rawPrediction
DenseVector([-1.72..., -0.99...])
>>> test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
>>> model.transform(test1).head().prediction
1.0
>>> nb_path = temp_path + "/nb"
>>> nb.save(nb_path)
>>> nb2 = NaiveBayes.load(nb_path)
>>> nb2.getSmoothing()
1.0
>>> model_path = temp_path + "/nb_model"
>>> model.save(model_path)
>>> model2 = NaiveBayesModel.load(model_path)
>>> model.pi == model2.pi
True
>>> model.theta == model2.theta
True
>>> model.transform(test0).take(1) == model2.transform(test0).take(1)
True
>>> nb = nb.setThresholds([0.01, 10.00])
>>> model3 = nb.fit(df)
>>> result = model3.transform(test0).head()
>>> result.prediction
0.0
>>> nb3 = NaiveBayes().setModelType("gaussian")
>>> model4 = nb3.fit(df)
>>> model4.getModelType()
'gaussian'
>>> model4.sigma
DenseMatrix(2, 2, [0.0, 0.25, 0.0, 0.0], 1)
>>> nb5 = NaiveBayes(smoothing=1.0, modelType="complement", weightCol="weight")
>>> model5 = nb5.fit(df)
>>> model5.getModelType()
'complement'
>>> model5.theta
DenseMatrix(2, 2, [...], 1)
>>> model5.sigma
DenseMatrix(0, 0, [...], ...)

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.ml.classification.NaiveBayes。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。