Python pyspark DataFrame.describe用法及代码示例

本文简要介绍 pyspark.pandas.DataFrame.describe 的用法。

用法: DataFrame.describe(percentiles: Optional[List[float]] = None) → pyspark.pandas.frame.DataFrame

生成说明性统计数据，总结数据集分布的集中趋势、离散度和形状，不包括 NaN 值。

分析数字和对象系列，以及混合数据类型的DataFrame 列集。输出将根据提供的内容而有所不同。有关详细信息，请参阅下面的注释。

参数：

percentiles：float 列表在 [0.0, 1.0] 范围内，默认 [0.25, 0.5, 0.75]: 要计算的百分位数列表。

DataFrame: 提供的 DataFrame 的汇总统计信息。

注意：

对于数字数据，结果的索引将包括 count , mean , std , min , 25% , 50% , 75% , max 。

目前仅支持数字数据。

例子：

说明一个数字 Series 。

>>> s = ps.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.0
50%      2.0
75%      3.0
max      3.0
dtype: float64

说明 DataFrame 。仅返回数字字段。

>>> df = ps.DataFrame({'numeric1': [1, 2, 3],
...                    'numeric2': [4.0, 5.0, 6.0],
...                    'object': ['a', 'b', 'c']
...                   },
...                   columns=['numeric1', 'numeric2', 'object'])
>>> df.describe()
       numeric1  numeric2
count       3.0       3.0
mean        2.0       5.0
std         1.0       1.0
min         1.0       4.0
25%         1.0       4.0
50%         2.0       5.0
75%         3.0       6.0
max         3.0       6.0

对于多索引列：

>>> df.columns = [('num', 'a'), ('num', 'b'), ('obj', 'c')]
>>> df.describe()  
       num
         a    b
count  3.0  3.0
mean   2.0  5.0
std    1.0  1.0
min    1.0  4.0
25%    1.0  4.0
50%    2.0  5.0
75%    3.0  6.0
max    3.0  6.0

>>> df[('num', 'b')].describe()
count    3.0
mean     5.0
std      1.0
min      4.0
25%      4.0
50%      5.0
75%      6.0
max      6.0
Name: (num, b), dtype: float64

说明 DataFrame 并选择自定义百分位数。

>>> df = ps.DataFrame({'numeric1': [1, 2, 3],
...                    'numeric2': [4.0, 5.0, 6.0]
...                   },
...                   columns=['numeric1', 'numeric2'])
>>> df.describe(percentiles = [0.85, 0.15])
       numeric1  numeric2
count       3.0       3.0
mean        2.0       5.0
std         1.0       1.0
min         1.0       4.0
15%         1.0       4.0
50%         2.0       5.0
85%         3.0       6.0
max         3.0       6.0

通过将 DataFrame 中的列作为属性访问来说明该列。

>>> df.numeric1.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.0
50%      2.0
75%      3.0
max      3.0
Name: numeric1, dtype: float64

通过将 DataFrame 中的列作为属性访问并选择自定义百分位来说明列。

>>> df.numeric1.describe(percentiles = [0.85, 0.15])
count    3.0
mean     2.0
std      1.0
min      1.0
15%      1.0
50%      2.0
85%      3.0
max      3.0
Name: numeric1, dtype: float64

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.pandas.DataFrame.describe。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

用法:

参数：

返回：

注意：

例子：