R ggplot2 geom_histogram 直方圖和頻數多邊形

通過將 x 軸劃分為多個箱並計算每個箱中的觀測值數量，可視化單個連續變量的分布。直方圖 (geom_histogram()) 用條形顯示計數；頻率多邊形 (geom_freqpoly()) 用線條顯示計數。當您想要比較分類變量各個級別的分布時，頻率多邊形更合適。

用法

geom_freqpoly(
  mapping = NULL,
  data = NULL,
  stat = "bin",
  position = "identity",
  ...,
  na.rm = FALSE,
  show.legend = NA,
  inherit.aes = TRUE
)

geom_histogram(
  mapping = NULL,
  data = NULL,
  stat = "bin",
  position = "stack",
  ...,
  binwidth = NULL,
  bins = NULL,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

stat_bin(
  mapping = NULL,
  data = NULL,
  geom = "bar",
  position = "stack",
  ...,
  binwidth = NULL,
  bins = NULL,
  center = NULL,
  boundary = NULL,
  breaks = NULL,
  closed = c("right", "left"),
  pad = FALSE,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

參數

mapping

由 aes() 創建的一組美學映射。如果指定且inherit.aes = TRUE(默認)，它將與繪圖頂層的默認映射組合。如果沒有繪圖映射，則必須提供mapping。

data

該層要顯示的數據。有以下三種選擇：

如果默認為 NULL ，則數據繼承自 ggplot() 調用中指定的繪圖數據。

data.frame 或其他對象將覆蓋繪圖數據。所有對象都將被強化以生成 DataFrame 。請參閱fortify() 將為其創建變量。

將使用單個參數(繪圖數據)調用function。返回值必須是 data.frame ，並將用作圖層數據。可以從 formula 創建 function (例如 ~ head(.x, 10) )。

position

位置調整，可以是命名調整的字符串(例如 "jitter" 使用 position_jitter )，也可以是調用位置調整函數的結果。如果需要更改調整設置，請使用後者。

...

其他參數傳遞給 layer() 。這些通常是美學，用於將美學設置為固定值，例如 colour = "red" 或 size = 3 。它們也可能是配對的 geom/stat 的參數。

na.rm

如果 FALSE ，則默認缺失值將被刪除並帶有警告。如果 TRUE ，缺失值將被靜默刪除。

show.legend

合乎邏輯的。該層是否應該包含在圖例中？ NA(默認值)包括是否映射了任何美學。 FALSE 從不包含，而 TRUE 始終包含。它也可以是一個命名的邏輯向量，以精細地選擇要顯示的美學。

inherit.aes

如果 FALSE ，則覆蓋默認美學，而不是與它們組合。這對於定義數據和美觀的輔助函數最有用，並且不應繼承默認繪圖規範的行為，例如borders() 。

binwidth

箱子的寬度。可以指定為數值或根據未縮放的 x 計算寬度的函數。這裏，"unscaled x" 指的是應用任何尺度變換之前數據中的原始 x 值。當指定函數和分組結構時，每個組將調用該函數一次。默認是使用 bins 中的 bin 數量，覆蓋數據範圍。您應該始終覆蓋此值，探索多個寬度以找到最能說明數據中的故事的寬度。

日期變量的 bin 寬度是每個時間的天數；時間變量的 bin 寬度是秒數。

bins

箱子數量。被 binwidth 覆蓋。默認為 30。

orientation

層的方向。默認值 ( NA ) 自動根據美學映射確定方向。萬一失敗，可以通過將 orientation 設置為 "x" 或 "y" 來顯式給出。有關更多詳細信息，請參閱方向部分。

geom, stat

用於覆蓋 geom_histogram() /geom_freqpoly() 和 stat_bin() 之間的默認連接。

center, boundary

bin 位置說明符。隻能為單個繪圖指定一個 center 或 boundary 。 center 指定其中一個 bin 的中心。 boundary 指定兩個 bin 之間的邊界。請注意，如果其中一個高於或低於數據範圍，則數據將按 binwidth 的適當整數倍移動。例如，要以整數為中心，請使用 binwidth = 1 和 center = 0 ，即使 0 超出數據範圍也是如此。或者，即使 0.5 超出數據範圍，也可以使用 binwidth = 1 和 boundary = 0.5 指定相同的對齊方式。

breaks

或者，您可以提供給出 bin 邊界的數值向量。覆蓋 binwidth 、 bins 、 center 和 boundary 。

closed

"right" 或 "left" 之一指示該箱中是否包含箱的右邊或左邊。

pad

如果 TRUE ，則在 x 的任一端添加空 bin。這可確保頻率多邊形接觸 0。默認為 FALSE 。

細節

stat_bin()僅適用於連續x數據。如果您的 x 數據是離散的，您可能需要使用 stat_count() 。

默認情況下，底層計算 (stat_bin()) 使用 30 個 bin；這不是一個好的默認值，但其想法是讓您嘗試不同數量的箱子。您還可以嘗試使用 center 或 boundary 參數修改 binwidth。 binwidth 會覆蓋 bins，因此您應該一次進行一項更改。您可能需要查看一些選項來揭示數據背後的完整故事。

除了 geom_histogram() 之外，您還可以使用 scale_x_binned() 和 geom_bar() 來創建直方圖。默認情況下，此方法會在每個條形之間繪製刻度線。

方向

該幾何體以不同的方式對待每個軸，因此可以有兩個方向。通常，方向很容易從給定映射和使用的位置比例類型的組合中推斷出來。因此，ggplot2 默認情況下會嘗試猜測圖層應具有哪個方向。在極少數情況下，方向不明確，猜測可能會失敗。在這種情況下，可以直接使用 orientation 參數指定方向，該參數可以是 "x" 或 "y" 。該值給出了幾何圖形應沿著的軸，"x" 是您期望的幾何圖形的默認方向。

美學

geom_histogram() 使用與 geom_bar() 相同的美學； geom_freqpoly() 使用與 geom_line() 相同的美學。

計算變量

這些是由層的 'stat' 部分計算的，可以使用 delayed evaluation 訪問。

after_stat(count)
bin 中的點數。
after_stat(density)
bin 中點的密度，縮放至積分為 1。
after_stat(ncount)
計數，縮放至最大值 1。
after_stat(ndensity)
密度，縮放至最大值 1。
after_stat(width)
箱子的寬度。

刪除變量

weight: 分箱後，各個數據點的權重(如果提供)不再可用。

也可以看看

stat_count() ，計算每個 x 位置的案例數，不進行分箱。它適用於離散和連續 x 數據，而 stat_bin() 僅適用於連續 x 數據。

例子

ggplot(diamonds, aes(carat)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(diamonds, aes(carat)) +
  geom_histogram(binwidth = 0.01)

ggplot(diamonds, aes(carat)) +
  geom_histogram(bins = 200)

# Map values to y to flip the orientation
ggplot(diamonds, aes(y = carat)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


# For histograms with tick marks between each bin, use `geom_bar()` with
# `scale_x_binned()`.
ggplot(diamonds, aes(carat)) +
  geom_bar() +
  scale_x_binned()


# Rather than stacking histograms, it's easier to compare frequency
# polygons
ggplot(diamonds, aes(price, fill = cut)) +
  geom_histogram(binwidth = 500)

ggplot(diamonds, aes(price, colour = cut)) +
  geom_freqpoly(binwidth = 500)


# To make it easier to compare distributions with very different counts,
# put density on the y axis instead of the default count
ggplot(diamonds, aes(price, after_stat(density), colour = cut)) +
  geom_freqpoly(binwidth = 500)


if (require("ggplot2movies")) {
# Often we don't want the height of the bar to represent the
# count of observations, but the sum of some other variable.
# For example, the following plot shows the number of movies
# in each rating.
m <- ggplot(movies, aes(rating))
m + geom_histogram(binwidth = 0.1)

# If, however, we want to see the number of votes cast in each
# category, we need to weight by the votes variable
m +
  geom_histogram(aes(weight = votes), binwidth = 0.1) +
  ylab("votes")

# For transformed scales, binwidth applies to the transformed data.
# The bins have constant width on the transformed scale.
m +
 geom_histogram() +
 scale_x_log10()
m +
  geom_histogram(binwidth = 0.05) +
  scale_x_log10()

# For transformed coordinate systems, the binwidth applies to the
# raw data. The bins have constant width on the original scale.

# Using log scales does not work here, because the first
# bar is anchored at zero, and so when transformed becomes negative
# infinity. This is not a problem when transforming the scales, because
# no observations have 0 ratings.
m +
  geom_histogram(boundary = 0) +
  coord_trans(x = "log10")
# Use boundary = 0, to make sure we don't take sqrt of negative values
m +
  geom_histogram(boundary = 0) +
  coord_trans(x = "sqrt")

# You can also transform the y axis.  Remember that the base of the bars
# has value 0, so log transformations are not appropriate
m <- ggplot(movies, aes(x = rating))
m +
  geom_histogram(binwidth = 0.5) +
  scale_y_sqrt()
}


# You can specify a function for calculating binwidth, which is
# particularly useful when faceting along variables with
# different ranges because the function will be called once per facet
ggplot(economics_long, aes(value)) +
  facet_wrap(~variable, scales = 'free_x') +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)))

源代碼：R/geom-freqpoly.R、R/geom-histogram.R、R/stat-bin.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Histograms and frequency polygons。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。