R dplyr group_by 按一個或多個變量分組

大多數數據操作都是在變量定義的組上完成的。 group_by() 獲取現有表並將其轉換為分組表，在其中執行操作"by group"。 ungroup() 刪除分組。

用法

group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))

ungroup(x, ...)

參數

.data

數據幀、數據幀擴展(例如 tibble)或惰性數據幀(例如來自 dbplyr 或 dtplyr)。有關更多詳細信息，請參閱下麵的方法。

...

在 group_by() 中，用於分組的變量或計算。計算始終在未分組的數據幀上完成。要對分組數據執行計算，您需要在 group_by() 之前使用單獨的 mutate() 步驟。 nest_by() 中不允許進行計算。在 ungroup() 中，要從分組中刪除的變量。

.add

當 FALSE 時，默認情況下，group_by() 將覆蓋現有組。要添加到現有組，請使用 .add = TRUE 。

該參數以前稱為 add ，但這阻止了創建名為 add 的新分組變量，並且與我們的命名約定衝突。

.drop

刪除由數據中未出現的因子水平形成的組？默認值為 TRUE，除非 .data 之前已與 .drop = FALSE 分組。有關詳細信息，請參閱group_by_drop_default()。

x

tbl()

值

具有類 grouped_df 的分組 DataFrame ，除非 ... 和 add 的組合產生一組空的分組列，在這種情況下將返回 tibble。

方法

這些函數是泛型函數，這意味著包可以為其他類提供實現(方法)。有關額外參數和行為差異，請參閱各個方法的文檔。

當前加載的包中可用的方法：

group_by()：dbplyr(tbl_lazy)、dplyr(data.frame)。
ungroup()：dbplyr(tbl_lazy)、dplyr(data.frame、grouped_df、rowwise_df)。

排序

目前，group_by() 在內部按升序對組進行排序。這會導致聚合組的函數產生有序輸出，例如 summarise() 。

當用作分組列時，字符向量在 C 語言環境中進行排序，以提高 R 會話之間的性能和可重複性。如果分組操作的結果順序很重要並且取決於區域設置，則您應該通過顯式調用 arrange() 來跟蹤分組操作並設置 .locale 參數。例如：

data %>%
  group_by(chr) %>%
  summarise(avg = mean(x)) %>%
  arrange(chr, .locale = "en")

這通常可以作為生成供人類使用的內容(例如 HTML 表格)之前的預備步驟。

遺留行為

在 dplyr 1.1.0 之前，字符向量分組列在係統區域設置中進行排序。如果您需要暫時恢複此行為，可以將全局選項 dplyr.legacy_locale 設置為 TRUE ，但應謹慎使用，並且您應該期望在 dplyr 的未來版本中刪除此選項。最好更新現有代碼以顯式調用arrange(.locale = )。請注意，設置 dplyr.legacy_locale 還將強製調用 arrange() 以使用係統區域設置。

也可以看看

其他分組函數：group_map()、group_nest()、group_split()、group_trim()

例子

by_cyl <- mtcars %>% group_by(cyl)

# grouping doesn't change how the data looks (apart from listing
# how it's grouped):
by_cyl
#> # A tibble: 32 × 11
#> # Groups:   cyl [3]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows

# It changes how it acts with the other dplyr verbs:
by_cyl %>% summarise(
  disp = mean(disp),
  hp = mean(hp)
)
#> # A tibble: 3 × 3
#>     cyl  disp    hp
#>   <dbl> <dbl> <dbl>
#> 1     4  105.  82.6
#> 2     6  183. 122. 
#> 3     8  353. 209. 
by_cyl %>% filter(disp == max(disp))
#> # A tibble: 3 × 11
#> # Groups:   cyl [3]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#> 2  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 3  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4

# Each call to summarise() removes a layer of grouping
by_vs_am <- mtcars %>% group_by(vs, am)
by_vs <- by_vs_am %>% summarise(n = n())
#> `summarise()` has grouped output by 'vs'. You can override using the
#> `.groups` argument.
by_vs
#> # A tibble: 4 × 3
#> # Groups:   vs [2]
#>      vs    am     n
#>   <dbl> <dbl> <int>
#> 1     0     0    12
#> 2     0     1     6
#> 3     1     0     7
#> 4     1     1     7
by_vs %>% summarise(n = sum(n))
#> # A tibble: 2 × 2
#>      vs     n
#>   <dbl> <int>
#> 1     0    18
#> 2     1    14

# To removing grouping, use ungroup
by_vs %>%
  ungroup() %>%
  summarise(n = sum(n))
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1    32

# By default, group_by() overrides existing grouping
by_cyl %>%
  group_by(vs, am) %>%
  group_vars()
#> [1] "vs" "am"

# Use add = TRUE to instead append
by_cyl %>%
  group_by(vs, am, .add = TRUE) %>%
  group_vars()
#> [1] "cyl" "vs"  "am" 

# You can group by expressions: this is a short-hand
# for a mutate() followed by a group_by()
mtcars %>%
  group_by(vsam = vs + am)
#> # A tibble: 32 × 12
#> # Groups:   vsam [3]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  vsam
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     1
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     1
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     2
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     0
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     0
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     1
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     1
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     1
#> # ℹ 22 more rows

# The implicit mutate() step is always performed on the
# ungrouped data. Here we get 3 groups:
mtcars %>%
  group_by(vs) %>%
  group_by(hp_cut = cut(hp, 3))
#> # A tibble: 32 × 12
#> # Groups:   hp_cut [3]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows
#> # ℹ 1 more variable: hp_cut <fct>

# If you want it to be performed by groups,
# you have to use an explicit mutate() call.
# Here we get 3 groups per value of vs
mtcars %>%
  group_by(vs) %>%
  mutate(hp_cut = cut(hp, 3)) %>%
  group_by(hp_cut)
#> # A tibble: 32 × 12
#> # Groups:   hp_cut [6]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # ℹ 22 more rows
#> # ℹ 1 more variable: hp_cut <fct>

# when factors are involved and .drop = FALSE, groups can be empty
tbl <- tibble(
  x = 1:10,
  y = factor(rep(c("a", "c"), each  = 5), levels = c("a", "b", "c"))
)
tbl %>%
  group_by(y, .drop = FALSE) %>%
  group_rows()
#> <list_of<integer>[3]>
#> [[1]]
#> [1] 1 2 3 4 5
#> 
#> [[2]]
#> integer(0)
#> 
#> [[3]]
#> [1]  6  7  8  9 10
#>

源代碼：R/group-by.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Group by one or more variables。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。