Python cudf.Series.describe用法及代码示例

用法: Series.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)

生成说明性统计数据。

说明性统计包括总结数据集分布的集中趋势、离散度和形状的统计，不包括NaN 值。

分析数字和对象系列，以及混合数据类型的DataFrame 列集。输出将根据提供的内容而有所不同。有关详细信息，请参阅下面的注释。

参数：

percentiles：list-like 个数字，可选

要包含在输出中的百分位数。全部应介于 0 和 1 之间。默认值为 [.25, .5, .75] ，它返回第 25、第 50 和第 75 个百分位数。

include：‘all’, list-like of dtypes 或 None(默认)，可选

要包含在结果中的数据类型列表。忽略 Series 。以下是选项：

‘all’：输入的所有列都将包含在输出中。
A list-like of dtypes ：将结果限制为提供的数据类型。要将结果限制为数字类型，请提交 numpy.number 。要将其限制为对象列，请提交 numpy.object 数据类型。字符串也可以以 select_dtypes 的样式使用(例如 df.describe(include=['O']) )。要选择 pandas 分类列，请使用 'category'
无(默认)：结果将包括所有数字列。

exclude：list-like of dtypes 或 None(默认)，可选，

要从结果中省略的数据类型列表。忽略 Series 。以下是选项：

A list-like of dtypes ：从结果中排除提供的数据类型。要排除数字类型，请提交 numpy.number 。要排除对象列，请提交数据类型 numpy.object 。字符串也可以以 select_dtypes 的样式使用(例如 df.describe(include=['O']) )。要排除 pandas 分类列，请使用 'category'
无(默认)：结果将不排除任何内容。

datetime_is_numeric：布尔值，默认为 False

对于 DataFrame 输入，这还控制默认情况下是否包含日期时间列。

output_frame：Series或DataFrame: 提供的系列或 DataFrame 的汇总统计信息。

注意：

对于数字数据，结果的索引将包括 count , mean , std , min , max 以及较低的、50 和较高的百分位数。默认情况下，下百分位是 25 ，上百分位是 75 。 50 百分位数与中位数相同。

对于字符串 dtype 或 datetime dtype，结果的索引将包括 count , unique , top 和 freq 。 top 是最常见的值。 freq 是最常见值的频率。时间戳还包括first 和last 项。

如果多个对象值具有最高计数，则将从具有最高计数的那些中任意选择count 和top 结果。

对于通过 DataFrame 提供的混合数据类型，默认情况下仅返回对数值列的分析。如果 DataFrame 仅包含对象和分类数据而没有任何数字列，则默认返回对对象和分类列的分析。如果 include='all' 作为选项提供，则结果将包括每种类型的属性的联合。

include 和 exclude 参数可用于限制 DataFrame 中的哪些列被分析用于输出。分析 Series 时忽略这些参数。

例子：

说明包含数值的Series。

>>> import cudf
>>> s = cudf.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> s
0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64
>>> s.describe()
count    10.00000
mean      5.50000
std       3.02765
min       1.00000
25%       3.25000
50%       5.50000
75%       7.75000
max      10.00000
dtype: float64

说明一个分类的 Series 。

>>> s = cudf.Series(['a', 'b', 'a', 'b', 'c', 'a'], dtype='category')
>>> s
0    a
1    b
2    a
3    b
4    c
5    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.describe()
count     6
unique    3
top       a
freq      3
dtype: object

说明时间戳 Series 。

>>> import numpy as np
>>> s = cudf.Series([
...   np.datetime64("2000-01-01"),
...   np.datetime64("2010-01-01"),
...   np.datetime64("2010-01-01")
... ])
>>> s
0   2000-01-01
1   2010-01-01
2   2010-01-01
dtype: datetime64[s]
>>> s.describe()
count                     3
mean    2006-09-01 08:00:00
min     2000-01-01 00:00:00
25%     2004-12-31 12:00:00
50%     2010-01-01 00:00:00
75%     2010-01-01 00:00:00
max     2010-01-01 00:00:00
dtype: object

说明 DataFrame 。默认情况下，仅返回数字字段。

>>> df = cudf.DataFrame({"categorical": cudf.Series(['d', 'e', 'f'],
...                         dtype='category'),
...                      "numeric": [1, 2, 3],
...                      "object": ['a', 'b', 'c']
... })
>>> df
  categorical  numeric object
0           d        1      a
1           e        2      b
2           f        3      c
>>> df.describe()
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

说明 DataFrame 的所有列，无论数据类型如何。

>>> df.describe(include='all')
       categorical numeric object
count            3     3.0      3
unique           3    <NA>      3
top              d    <NA>      a
freq             1    <NA>      1
mean          <NA>     2.0   <NA>
std           <NA>     1.0   <NA>
min           <NA>     1.0   <NA>
25%           <NA>     1.5   <NA>
50%           <NA>     2.0   <NA>
75%           <NA>     2.5   <NA>
max           <NA>     3.0   <NA>

通过将 DataFrame 中的列作为属性访问来说明该列。

>>> df.numeric.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

在DataFrame 说明中仅包括数字列。

>>> df.describe(include=[np.number])
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

在DataFrame 说明中仅包括字符串列。

>>> df.describe(include=[object])
       object
count       3
unique      3
top         a
freq        1

仅包括来自 DataFrame 说明的分类列。

>>> df.describe(include=['category'])
       categorical
count            3
unique           3
top              d
freq             1

从 DataFrame 说明中排除数字列。

>>> df.describe(exclude=[np.number])
       categorical object
count            3      3
unique           3      3
top              d      a
freq             1      1

从 DataFrame 说明中排除对象列。

>>> df.describe(exclude=[object])
       categorical numeric
count            3     3.0
unique           3    <NA>
top              d    <NA>
freq             1    <NA>
mean          <NA>     2.0
std           <NA>     1.0
min           <NA>     1.0
25%           <NA>     1.5
50%           <NA>     2.0
75%           <NA>     2.5
max           <NA>     3.0

相关用法

注：本文由纯净天空筛选整理自rapids.ai大神的英文原创作品 cudf.Series.describe。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

用法:

参数：

返回：

注意：

例子：