Python pyspark DataFrame.describe用法及代碼示例

本文簡要介紹 pyspark.pandas.DataFrame.describe 的用法。

用法: DataFrame.describe(percentiles: Optional[List[float]] = None) → pyspark.pandas.frame.DataFrame

生成說明性統計數據，總結數據集分布的集中趨勢、離散度和形狀，不包括 NaN 值。

分析數字和對象係列，以及混合數據類型的DataFrame 列集。輸出將根據提供的內容而有所不同。有關詳細信息，請參閱下麵的注釋。

參數：

percentiles：float 列表在 [0.0, 1.0] 範圍內，默認 [0.25, 0.5, 0.75]: 要計算的百分位數列表。

DataFrame: 提供的 DataFrame 的匯總統計信息。

注意：

對於數字數據，結果的索引將包括 count , mean , std , min , 25% , 50% , 75% , max 。

目前僅支持數字數據。

例子：

說明一個數字 Series 。

>>> s = ps.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.0
50%      2.0
75%      3.0
max      3.0
dtype: float64

說明 DataFrame 。僅返回數字字段。

>>> df = ps.DataFrame({'numeric1': [1, 2, 3],
...                    'numeric2': [4.0, 5.0, 6.0],
...                    'object': ['a', 'b', 'c']
...                   },
...                   columns=['numeric1', 'numeric2', 'object'])
>>> df.describe()
       numeric1  numeric2
count       3.0       3.0
mean        2.0       5.0
std         1.0       1.0
min         1.0       4.0
25%         1.0       4.0
50%         2.0       5.0
75%         3.0       6.0
max         3.0       6.0

對於多索引列：

>>> df.columns = [('num', 'a'), ('num', 'b'), ('obj', 'c')]
>>> df.describe()  
       num
         a    b
count  3.0  3.0
mean   2.0  5.0
std    1.0  1.0
min    1.0  4.0
25%    1.0  4.0
50%    2.0  5.0
75%    3.0  6.0
max    3.0  6.0

>>> df[('num', 'b')].describe()
count    3.0
mean     5.0
std      1.0
min      4.0
25%      4.0
50%      5.0
75%      6.0
max      6.0
Name: (num, b), dtype: float64

說明 DataFrame 並選擇自定義百分位數。

>>> df = ps.DataFrame({'numeric1': [1, 2, 3],
...                    'numeric2': [4.0, 5.0, 6.0]
...                   },
...                   columns=['numeric1', 'numeric2'])
>>> df.describe(percentiles = [0.85, 0.15])
       numeric1  numeric2
count       3.0       3.0
mean        2.0       5.0
std         1.0       1.0
min         1.0       4.0
15%         1.0       4.0
50%         2.0       5.0
85%         3.0       6.0
max         3.0       6.0

通過將 DataFrame 中的列作為屬性訪問來說明該列。

>>> df.numeric1.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.0
50%      2.0
75%      3.0
max      3.0
Name: numeric1, dtype: float64

通過將 DataFrame 中的列作為屬性訪問並選擇自定義百分位來說明列。

>>> df.numeric1.describe(percentiles = [0.85, 0.15])
count    3.0
mean     2.0
std      1.0
min      1.0
15%      1.0
50%      2.0
85%      3.0
max      3.0
Name: numeric1, dtype: float64

相關用法

注：本文由純淨天空篩選整理自spark.apache.org大神的英文原創作品 pyspark.pandas.DataFrame.describe。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

參數：

返回：

注意：

例子：