Python Pandas DataFrame groupby方法用法及代碼示例

Pandas 的DataFrame.groupby(~) 根據指定的標準將您的DataFrame 分為幾組。返回的值很有用，因為它允許您計算統計數據(例如平均值和最小值)並按組應用轉換。

參數

1.by | scalar 或 array-like 或 dict

劃分 DataFrame 的標準。

2. axis | int 或 string | optional

是否將DataFrame分為列或行：

軸	說明
`0` 或 `"index"`	DataFrame將被分為幾列。
`1` 或 `"columns"`	DataFrame將被分成行。

默認情況下，axis=0 。

3. level | int 或 string | optional

目標水平。僅當源 DataFrame 是多索引時，這才相關。默認情況下，level=None 。

4. as_index | boolean | optional

組標簽是否用作結果 DataFrame 的索引。默認情況下，as_index=True 。

5. sort | boolean | optional

是否按組標簽對組進行排序。默認情況下，sort=True 。為了提高性能，當不需要此行為時，請考慮傳遞False。

6. group_keys | boolean | optional

如果我們應用的聚合函數更改了索引，是否在索引中包含組標簽。檢查下麵的示例以進行說明。默認情況下，group_keys=True 。

7. squeeze | boolean | optional

如果可能的話是否返回簡化類型。請參閱下麵的示例以進行說明，默認情況下為 squeeze=False 。

返回值

DataFrameGroupBy 對象。

例子

考慮以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})
df



   price  brand   device
0  200    apple   phone
1  300    google  phone
2  700    apple   computer
3  900    google  phone

按單列分組

要使用 brand 列的類別劃分 DataFrame：

groups_brand = df.groupby("brand")      # Returns a groupby object

我們可以使用屬性 groups 查看分區：

df.groupby("brand").groups



{'apple': Int64Index([0, 2], dtype='int64'),
 'google': Int64Index([1, 3], dtype='int64')}

我們可以看到，DataFrame分為兩組：apple品牌和google品牌。與apple品牌對應的行是索引0和索引2，而與google品牌對應的行是索引1和3。

您可以使用返回的 groupby 對象執行許多操作，例如計算每個品牌的平均價格：

groups_brand.mean()   # Returns a Series



        price
brand
apple   450
google  600

請注意價格列的 mean 是如何計算的，而 device 列的計算不是，即使我們沒有顯式指定這些列。這是因為聚合函數僅適用於數字列。

按多列分組

確定蘋果手機、蘋果電腦、穀歌手機和穀歌電腦的平均價格：

df.groupby(["brand","device"]).mean()



                    price
brand    device	
apple    computer   700
         phone      200
google   phone      600

請注意， google 計算機在 df 中不存在，因此您在輸出中看不到它們。

僅供您參考，我們再次在此顯示df：

df



   price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

迭代 groupby 對象

要迭代 groupby 對象的所有組：

for group_name, group in df.groupby("brand"):
 print("group_name:", group_name)
 print(group)   # DataFrame



group_name: apple
   price  brand  device
0   200   apple  phone
2   700   apple  computer

group_name: google
   price  brand   device
1   300   google  phone
3   900   google  phone

使用聚合函數

要計算 apple 和 google 設備的平均價格：

df.groupby("brand").agg("mean")   # Returns a DataFrame



        price
brand
apple   450
google  600

請注意，groupby 對象具有常見聚合的實例方法：

df.groupby("brand").mean()



        price
brand	
apple   450
google  600

您還可以傳入 agg 的函數列表來計算多個聚合：

df.groupby("brand").agg(["mean", np.max])



        price
        mean  amax
brand      
apple   450   700
google  600   900

單擊此處了解有關聚合函數的更多信息。

逐組應用變換

這裏再次df供您參考：

df



   price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

要按組應用轉換，請使用 transform(~) ：

df.groupby("brand").transform(lambda col: col - col.max())



   price
0  -500
1  -600
2   0
3   0

請注意以下事項：

這是一個完全任意的例子，我們將每個值移動它所屬組的最大值。
我們在索引0中獲得-500，因為brand=apple組的最高價格是700，所以200-700=-500。
傳遞到函數 ( col ) 的參數的類型為 Series ，它表示組中的單個列。
整個代碼片段的返回類型是 DataFrame 。

更實際的是，我們經常按組應用轉換以實現標準化。

在返回的結果中僅包含列的子集

考慮以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})
df



   pricing  rating  brand
0   200       3     apple
1   300       4     google
2   700       5     apple
3   900       3     google

我們這裏有兩個數字列： price 和 rating 。默認情況下，調用 mean() 等聚合會導致計算所有數字列的聚合。例如：

df.groupby("brand").mean()     # Returns a DataFrame



         price   rating
brand		
apple    450.0   4.0
google   600.0   3.5

要僅計算 price 列的聚合，您可以直接在 groupby 對象上使用 [] 表示法：

df.groupby("brand")["price"].mean()   # Returns a Series



brand
apple     450
google    600
Name: price, dtype: int64

使用關鍵字參數來命名列

使用 agg 方法時，您可以通過提供關鍵字參數將列標簽分配給結果DataFrame：

df.groupby("brand")["price"].agg(mean_price="mean")



        mean_price
brand   
apple      450
google     600

為此，您必須指定要聚合的列(在本例中為"price")。

指定as_index

考慮以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["C","B","A","B"]}, index=["a","b","c","d"])
df



    price   brand
a    200     C
b    300     B
c    700     A
d    900     B

默認情況下， as_index=True ，這意味著組標簽將用作結果 DataFrame 的索引：

df.groupby("brand", as_index=True).mean()



       price
brand   
A       700
B       600
C       200

設置 as_index=False 會將組標簽設置為列：

df.groupby("brand", as_index=False).mean()



    brand   price
0     A     700
1     B     600
2     C     200

請注意索引是默認整數索引 ([0,1,2])。

指定排序

考慮以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700,900], "brand":["C","B","A","B"]}, index=["a","b","c","d"])
df



   price  brand
a  200      C
b  300      B
c  700      A
d  900      B

默認情況下， sort=True ，這意味著組標簽將被排序：

df.groupby("brand", sort=True).mean()



        price
brand   
A       700
B       600
C       200

請注意結果 DataFrame 的索引如何按升序排序。

如果我們不需要這種行為，那麽，像這樣設置sort=False：

df.groupby("brand", sort=False).mean()



       price
brand   
C       200
B       600
A       700

指定group_keys

考慮以下 DataFrame ：

df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])
df



   price  brand
a  200    apple
b  300    google
c  700    apple
d  900    google

默認情況下， group_keys=True ，這意味著組名稱將位於生成的 DataFrame 的索引中：

df.groupby("brand", group_keys=True).apply(lambda x: x.reset_index())



           index   price   brand
brand            
apple   0    a     200     apple
        1    c     700     apple
google  0    b     300     google
        1    d     900     google

注意group_keys僅當我們應用該函數時才生效更改結果DataFrame的索引，就像DataFrame reset_index方法.

設置 group_keys=False 將從生成的 DataFrame 中刪除組名稱：

df.groupby("brand", group_keys=False).apply(lambda x: x.reset_index())



    index   price   brand
0     a     200     apple
1     c     700     apple
0     b     300     google
1     d     900     google

請注意這些品牌如何不再出現在索引中。

指定擠壓

默認情況下， squeeze=False ，這意味著即使可能，返回類型也不會被簡化。

考慮以下 DataFrame ：

df = pd.DataFrame({"A":[2,3]})
df



   A
0  2
1  3

這是一個完全任意的示例，但假設我們按列進行分組 A ，然後對於每個組，應用一個函數，該函數實際上隻返回包含 "b" 的 Series ：

df.groupby("A").apply(lambda x: pd.Series(["b"]))



   0
A
2  b
3  b

這裏，返回類型是 DataFrame ，它可以簡化為 Series 。

為了將我們的返回類型簡化為Series，請在groupby(~)中設置squeeze=True，如下所示：

df.groupby("A", squeeze=True).apply(lambda x: pd.Series(["b"]))



0    b
0    b
dtype: object

這裏，返回類型是Series。

相關用法

注：本文由純淨天空篩選整理自Isshin Inada大神的英文原創作品 Pandas DataFrame | groupby method。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。