Python pyspark Statistics.chiSqTest用法及代碼示例

本文簡要介紹 pyspark.mllib.stat.Statistics.chiSqTest 的用法。

用法: static chiSqTest(observed, expected=None)

如果 observed 是向量，則根據預期分布或均勻分布(默認情況)對觀測數據進行皮爾遜卡方擬合優度檢驗，每個類別的預期頻率為 1 / len(observed) 。

如果observed是矩陣，對輸入的列聯矩陣進行皮爾遜獨立性檢驗，該矩陣不能包含負項或總和為0的列或行。

如果 observed 是 LabeledPoint 的 RDD，則針對輸入 RDD 中的標簽對每個特征進行 Pearson 獨立性測試。對於每個特征，(特征，標簽)對被轉換為計算卡方統計量的列聯矩陣。所有標簽和特征值必須是分類的。

參數：

observed： pyspark.mllib.linalg.Vector 或 pyspark.mllib.linalg.Matrix: 它可以是包含觀察到的分類計數/相對頻率的向量，或列聯矩陣(包含計數或相對頻率)，或包含具有分類特征的標記數據集的LabeledPoint RDD。實值特征將被視為每個不同值的分類特征。
expected：pyspark.mllib.linalg.Vector: 包含預期分類計數/相對頻率的向量。如果 expected 總和與 observed 總和不同，則重新調整 expected。

pyspark.mllib.stat.ChiSqTestResult: 包含檢驗統計量、自由度、p 值、使用的方法和原假設的對象。

注意：

observed 不能包含負值

例子：

>>> from pyspark.mllib.linalg import Vectors, Matrices
>>> observed = Vectors.dense([4, 6, 5])
>>> pearson = Statistics.chiSqTest(observed)
>>> print(pearson.statistic)
0.4
>>> pearson.degreesOfFreedom
2
>>> print(round(pearson.pValue, 4))
0.8187
>>> pearson.method
'pearson'
>>> pearson.nullHypothesis
'observed follows the same distribution as expected.'

>>> observed = Vectors.dense([21, 38, 43, 80])
>>> expected = Vectors.dense([3, 5, 7, 20])
>>> pearson = Statistics.chiSqTest(observed, expected)
>>> print(round(pearson.pValue, 4))
0.0027

>>> data = [40.0, 24.0, 29.0, 56.0, 32.0, 42.0, 31.0, 10.0, 0.0, 30.0, 15.0, 12.0]
>>> chi = Statistics.chiSqTest(Matrices.dense(3, 4, data))
>>> print(round(chi.statistic, 4))
21.9958

>>> data = [LabeledPoint(0.0, Vectors.dense([0.5, 10.0])),
...         LabeledPoint(0.0, Vectors.dense([1.5, 20.0])),
...         LabeledPoint(1.0, Vectors.dense([1.5, 30.0])),
...         LabeledPoint(0.0, Vectors.dense([3.5, 30.0])),
...         LabeledPoint(0.0, Vectors.dense([3.5, 40.0])),
...         LabeledPoint(1.0, Vectors.dense([3.5, 40.0])),]
>>> rdd = sc.parallelize(data, 4)
>>> chi = Statistics.chiSqTest(rdd)
>>> print(chi[0].statistic)
0.75
>>> print(chi[1].statistic)
1.5

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.mllib.stat.Statistics.chiSqTest。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

參數：

返回：

注意：

例子：