Python pandas.cut用法及代碼示例

用法: pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)

將值分類為離散間隔。

當您需要將數據值分段和排序到 bin 中時，請使用 cut。此函數對於從連續變量到分類變量也很有用。例如，cut 可以將年齡轉換為年齡範圍組。支持分箱成相等數量的箱，或預先指定的箱陣列。

參數：

x：array-like

要分箱的輸入數組。必須是一維的。

bins：int、標量序列或 IntervalIndex

分類依據的標準。

int：定義 x 範圍內的 equal-width bin 的數量。 x 的範圍每側擴展 0.1%，以包括 x 的最小值和最大值。
標量序列：定義允許非均勻寬度的 bin 邊。 x 的範圍沒有擴展。
IntervalIndex：定義要使用的確切 bin。請注意，bins 的 IntervalIndex 必須不重疊。

right：布爾值，默認為真

指示bins 是否包括最右邊的邊。如果right == True(默認值)，則bins [1, 2, 3, 4] 表示 (1,2], (2,3], (3,4]。當 bins 是 IntervalIndex 時忽略此參數。

labels：數組或假，默認無

指定返回的 bin 的標簽。必須與生成的 bin 長度相同。如果為 False，則僅返回 bin 的整數指示符。這會影響輸出容器的類型(見下文)。當 bins 是 IntervalIndex 時，忽略此參數。如果為 True，則引發錯誤。當 ordered=False 時，必須提供標簽。

retbins：布爾值，默認為 False

是否歸還箱子。當 bin 作為標量提供時很有用。

precision：整數，默認 3

存儲和顯示 bin 標簽的精度。

include_lowest：布爾值，默認為 False

第一個間隔是否應該是left-inclusive。

duplicates：{默認 ‘raise’, ‘drop’}，可選

如果 bin 邊不是唯一的，則引發 ValueError 或刪除非唯一的。

ordered：布爾值，默認為真

標簽是否有序。適用於返回的類型 Categorical 和 Series(使用 Categorical dtype)。如果為 True，則將對生成的分類進行排序。如果為 False，則生成的分類將是無序的(必須提供標簽)。

out：分類、係列或 ndarray

一個 array-like 對象，表示 x 的每個值的相應 bin。類型取決於 labels 的值。

無(默認)：為係列 x 返回一個係列，或為所有其他輸入返回一個分類。存儲在其中的值是 Interval dtype。
標量序列：為係列x返回一個係列，或為所有其他輸入返回一個分類。存儲在其中的值是序列中的任何類型。
False：返回整數的 ndarray。

bins：numpy.ndarray 或 IntervalIndex。

計算或指定的 bin。僅在 retbins=True 時返回。對於標量或序列 bins ，這是一個帶有計算 bin 的 ndarray。如果設置 duplicates=drop ， bins 將丟棄非唯一的 bin。對於 IntervalIndex bins ，這等於 bins 。

注意：

結果中的任何 NA 值都將是 NA。結果 Series 或 Categorical 對象中的越界值將是 NA。

有關更多示例，請參閱用戶指南。

例子：

離散為三個equal-sized bin。

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
... 
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]):[(0.994, 3.0] < (3.0, 5.0] ...

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
... 
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]):[(0.994, 3.0] < (3.0, 5.0] ...
array([0.994, 3.   , 5.   , 7.   ]))

發現相同的箱子，但為它們分配特定的標簽。請注意，返回的 Categorical 的類別是 labels 並且是有序的。

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
...        3, labels=["bad", "medium", "good"])
['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories (3, object):['bad' < 'medium' < 'good']

ordered=False 將在傳遞標簽時導致類別無序。此參數可用於允許非唯一標簽：

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
...        labels=["B", "A", "B"], ordered=False)
['B', 'B', 'A', 'A', 'B', 'B']
Categories (2, object):['A', 'B']

labels=False 表示您隻想要這些箱子。

>>> pd.cut([0, 1, 1, 2], bins=4, labels=False)
array([0, 1, 1, 3])

將 Series 作為輸入傳遞會返回具有分類 dtype 的 Series：

>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
...               index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, 3)
... 
a    (1.992, 4.667]
b    (1.992, 4.667]
c    (4.667, 7.333]
d     (7.333, 10.0]
e     (7.333, 10.0]
dtype:category
Categories (3, interval[float64, right]):[(1.992, 4.667] < (4.667, ...

將 Series 作為輸入傳遞會返回具有映射值的 Series。它用於基於 bin 以數字方式映射到間隔。

>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
...               index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False)
... 
(a    1.0
 b    2.0
 c    3.0
 d    4.0
 e    NaN
 dtype:float64,
 array([ 0,  2,  4,  6,  8, 10]))

當 bin 不唯一時使用 drop 可選

>>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True,
...        right=False, duplicates='drop')
... 
(a    1.0
 b    2.0
 c    3.0
 d    3.0
 e    NaN
 dtype:float64,
 array([ 0,  2,  4,  6, 10]))

為bins 傳遞一個IntervalIndex 會導致這些類別完全正確。請注意，IntervalIndex 未涵蓋的值設置為 NaN。 0 位於第一個 bin 的左側(右側封閉)，1.5 位於兩個 bin 之間。

>>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)])
>>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
[NaN, (0.0, 1.0], NaN, (2.0, 3.0], (4.0, 5.0]]
Categories (3, interval[int64, right]):[(0, 1] < (2, 3] < (4, 5]]

相關用法

注：本文由純淨天空篩選整理自pandas.pydata.org大神的英文原創作品 pandas.cut。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。

用法:

參數：

返回：

注意：

例子：