Python pyspark Series.groupby用法及代码示例

本文简要介绍 pyspark.pandas.Series.groupby 的用法。

用法: Series.groupby(by: Union[Any, Tuple[Any, …], Series, List[Union[Any, Tuple[Any, …], Series]]], axis: Union[int, str] = 0, as_index: bool = True, dropna: bool = True) → SeriesGroupBy

使用一系列列对 DataFrame 或系列进行分组。

groupby 操作涉及拆分对象、应用函数和组合结果的某种组合。这可用于对大量数据进行分组并在这些组上进行计算操作。

参数：

by：系列、标签或标签列表: 用于确定 groupby 的组。如果通过 Series，则 Series 或 dict VALUES 将用于确定组。标签或标签列表可以通过 self 中的列传递给分组。
axis：int，默认 0 或 ‘index’: 目前只能设置为0。
as_index：布尔值，默认为真: 对于聚合输出，返回以组标签作为索引的对象。仅与DataFrame 输入相关。 as_index=False 实际上是 “SQL-style” 分组输出。
dropna：布尔值，默认为真: 如果为 True，并且组键包含 NA 值，则 NA 值连同行/列将被删除。如果为 False，NA 值也将被视为组中的键。

DataFrameGroupBy 或 SeriesGroupBy: 取决于调用对象并返回包含有关组的信息的 groupby 对象。

例子：

>>> df = ps.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]},
...                   columns=['Animal', 'Max Speed'])
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0

>>> df.groupby(['Animal']).mean().sort_index()  
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

>>> df.groupby(['Animal'], as_index=False).mean().sort_values('Animal')
... 
   Animal  Max Speed
...Falcon      375.0
...Parrot       25.0

我们也可以通过设置 dropna 参数来选择是否在组键中包含 NA，默认设置为 True：

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = ps.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum().sort_index()  
     a  c
b
1.0  2  3
2.0  2  5

>>> df.groupby(by=["b"], dropna=False).sum().sort_index()  
     a  c
b
1.0  2  3
2.0  2  5
NaN  1  4

相关用法

注：本文由纯净天空筛选整理自spark.apache.org大神的英文原创作品 pyspark.pandas.Series.groupby。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。

用法:

参数：

返回：

例子：