Python Pandas DataFrame groupby方法用法及代码示例

Pandas 的DataFrame.groupby(~) 根据指定的标准将您的DataFrame 分为几组。返回的值很有用，因为它允许您计算统计数据(例如平均值和最小值)并按组应用转换。

参数

1.by | scalar 或 array-like 或 dict

划分 DataFrame 的标准。

2. axis | int 或 string | optional

是否将DataFrame分为列或行：

轴	说明
`0` 或 `"index"`	DataFrame将被分为几列。
`1` 或 `"columns"`	DataFrame将被分成行。

默认情况下，axis=0 。

3. level | int 或 string | optional

目标水平。仅当源 DataFrame 是多索引时，这才相关。默认情况下，level=None 。

4. as_index | boolean | optional

组标签是否用作结果 DataFrame 的索引。默认情况下，as_index=True 。

5. sort | boolean | optional

是否按组标签对组进行排序。默认情况下，sort=True 。为了提高性能，当不需要此行为时，请考虑传递False。

6. group_keys | boolean | optional

如果我们应用的聚合函数更改了索引，是否在索引中包含组标签。检查下面的示例以进行说明。默认情况下，group_keys=True 。

7. squeeze | boolean | optional

如果可能的话是否返回简化类型。请参阅下面的示例以进行说明，默认情况下为 squeeze=False 。

返回值

DataFrameGroupBy 对象。

例子

考虑以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})
df



   price  brand   device
0  200    apple   phone
1  300    google  phone
2  700    apple   computer
3  900    google  phone

按单列分组

要使用 brand 列的类别划分 DataFrame：

groups_brand = df.groupby("brand")      # Returns a groupby object

我们可以使用属性 groups 查看分区：

df.groupby("brand").groups



{'apple': Int64Index([0, 2], dtype='int64'),
 'google': Int64Index([1, 3], dtype='int64')}

我们可以看到，DataFrame分为两组：apple品牌和google品牌。与apple品牌对应的行是索引0和索引2，而与google品牌对应的行是索引1和3。

您可以使用返回的 groupby 对象执行许多操作，例如计算每个品牌的平均价格：

groups_brand.mean()   # Returns a Series



        price
brand
apple   450
google  600

请注意价格列的 mean 是如何计算的，而 device 列的计算不是，即使我们没有显式指定这些列。这是因为聚合函数仅适用于数字列。

按多列分组

确定苹果手机、苹果电脑、谷歌手机和谷歌电脑的平均价格：

df.groupby(["brand","device"]).mean()



                    price
brand    device	
apple    computer   700
         phone      200
google   phone      600

请注意， google 计算机在 df 中不存在，因此您在输出中看不到它们。

仅供您参考，我们再次在此显示df：

df



   price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

迭代 groupby 对象

要迭代 groupby 对象的所有组：

for group_name, group in df.groupby("brand"):
 print("group_name:", group_name)
 print(group)   # DataFrame



group_name: apple
   price  brand  device
0   200   apple  phone
2   700   apple  computer

group_name: google
   price  brand   device
1   300   google  phone
3   900   google  phone

使用聚合函数

要计算 apple 和 google 设备的平均价格：

df.groupby("brand").agg("mean")   # Returns a DataFrame



        price
brand
apple   450
google  600

请注意，groupby 对象具有常见聚合的实例方法：

df.groupby("brand").mean()



        price
brand	
apple   450
google  600

您还可以传入 agg 的函数列表来计算多个聚合：

df.groupby("brand").agg(["mean", np.max])



        price
        mean  amax
brand      
apple   450   700
google  600   900

单击此处了解有关聚合函数的更多信息。

逐组应用变换

这里再次df供您参考：

df



   price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

要按组应用转换，请使用 transform(~) ：

df.groupby("brand").transform(lambda col: col - col.max())



   price
0  -500
1  -600
2   0
3   0

请注意以下事项：

这是一个完全任意的例子，我们将每个值移动它所属组的最大值。
我们在索引0中获得-500，因为brand=apple组的最高价格是700，所以200-700=-500。
传递到函数 ( col ) 的参数的类型为 Series ，它表示组中的单个列。
整个代码片段的返回类型是 DataFrame 。

更实际的是，我们经常按组应用转换以实现标准化。

在返回的结果中仅包含列的子集

考虑以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})
df



   pricing  rating  brand
0   200       3     apple
1   300       4     google
2   700       5     apple
3   900       3     google

我们这里有两个数字列： price 和 rating 。默认情况下，调用 mean() 等聚合会导致计算所有数字列的聚合。例如：

df.groupby("brand").mean()     # Returns a DataFrame



         price   rating
brand		
apple    450.0   4.0
google   600.0   3.5

要仅计算 price 列的聚合，您可以直接在 groupby 对象上使用 [] 表示法：

df.groupby("brand")["price"].mean()   # Returns a Series



brand
apple     450
google    600
Name: price, dtype: int64

使用关键字参数来命名列

使用 agg 方法时，您可以通过提供关键字参数将列标签分配给结果DataFrame：

df.groupby("brand")["price"].agg(mean_price="mean")



        mean_price
brand   
apple      450
google     600

为此，您必须指定要聚合的列(在本例中为"price")。

指定as_index

考虑以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["C","B","A","B"]}, index=["a","b","c","d"])
df



    price   brand
a    200     C
b    300     B
c    700     A
d    900     B

默认情况下， as_index=True ，这意味着组标签将用作结果 DataFrame 的索引：

df.groupby("brand", as_index=True).mean()



       price
brand   
A       700
B       600
C       200

设置 as_index=False 会将组标签设置为列：

df.groupby("brand", as_index=False).mean()



    brand   price
0     A     700
1     B     600
2     C     200

请注意索引是默认整数索引 ([0,1,2])。

指定排序

考虑以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "brand":["C","B","A","B"]}, index=["a","b","c","d"])
df



   price  brand
a  200      C
b  300      B
c  700      A
d  900      B

默认情况下， sort=True ，这意味着组标签将被排序：

df.groupby("brand", sort=True).mean()



        price
brand   
A       700
B       600
C       200

请注意结果 DataFrame 的索引如何按升序排序。

如果我们不需要这种行为，那么，像这样设置sort=False：

df.groupby("brand", sort=False).mean()



       price
brand   
C       200
B       600
A       700

指定group_keys

考虑以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])
df



   price  brand
a  200    apple
b  300    google
c  700    apple
d  900    google

默认情况下， group_keys=True ，这意味着组名称将位于生成的 DataFrame 的索引中：

df.groupby("brand", group_keys=True).apply(lambda x: x.reset_index())



           index   price   brand
brand            
apple   0    a     200     apple
        1    c     700     apple
google  0    b     300     google
        1    d     900     google

注意group_keys仅当我们应用该函数时才生效更改结果DataFrame的索引，就像DataFrame reset_index方法.

设置 group_keys=False 将从生成的 DataFrame 中删除组名称：

df.groupby("brand", group_keys=False).apply(lambda x: x.reset_index())



    index   price   brand
0     a     200     apple
1     c     700     apple
0     b     300     google
1     d     900     google

请注意这些品牌如何不再出现在索引中。

指定挤压

默认情况下， squeeze=False ，这意味着即使可能，返回类型也不会被简化。

考虑以下 DataFrame ：

df = pd.DataFrame({"A":[2,3]})
df



   A
0  2
1  3

这是一个完全任意的示例，但假设我们按列进行分组 A ，然后对于每个组，应用一个函数，该函数实际上只返回包含 "b" 的 Series ：

df.groupby("A").apply(lambda x: pd.Series(["b"]))



   0
A
2  b
3  b

这里，返回类型是 DataFrame ，它可以简化为 Series 。

为了将我们的返回类型简化为Series，请在groupby(~)中设置squeeze=True，如下所示：

df.groupby("A", squeeze=True).apply(lambda x: pd.Series(["b"]))



0    b
0    b
dtype: object

这里，返回类型是Series。

相关用法

注：本文由纯净天空筛选整理自Isshin Inada大神的英文原创作品 Pandas DataFrame | groupby method。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。