Python sklearn StratifiedGroupKFold用法及代码示例

本文简要介绍python语言中 sklearn.model_selection.StratifiedGroupKFold 的用法。

用法: class sklearn.model_selection.StratifiedGroupKFold(n_splits=5, shuffle=False, random_state=None)

具有非重叠组的分层K-Folds 迭代器变体。

此交叉验证对象是 StratifiedKFold 的变体，尝试返回具有不重叠组的分层折叠。通过保留每个类别的样本百分比来进行折叠。

同一组不会出现在两个不同的折叠中(不同组的数量必须至少等于折叠的数量)。

GroupKFold 和 StratifiedGroupKFold 之间的区别在于，前者尝试创建平衡折叠，以便每个折叠中不同组的数量大致相同，而 StratifiedGroupKFold 尝试创建保留每个折叠中样本百分比的折叠。考虑到分割之间不重叠组的约束，尽可能多地分类。

在用户指南中阅读更多信息。

参数：

n_splits：整数，默认=5: 折叠次数。必须至少为 2。
shuffle：布尔，默认=假: 是否在分成批次之前对每个类的样本进行洗牌。请注意，每个拆分中的样本不会被打乱。此实现只能对具有大致相同 y 分布的组进行 shuffle，不会执行全局 shuffle。
random_state：int 或 RandomState 实例，默认=无: 当 shuffle 为 True 时，random_state 会影响索引的顺序，从而控制每个类的每次折叠的随机性。否则，将 random_state 保留为 None 。传递 int 以在多个函数调用之间实现可重现的输出。请参阅术语表。

注意：

该实施旨在：

对于微不足道的组，尽可能模仿 StratifiedKFold 的行为(例如，当每个组仅包含一个样本时)。
对类标签保持不变：将 y = ["Happy", "Sad"] 重新标记为 y = [1, 0] 不应更改生成的索引。
尽可能基于样本进行分层，同时保持非重叠组约束。这意味着在某些情况下，当少数组包含大量样本时，将无法进行分层，并且行为将接近 GroupKFold。

例子：

>>> import numpy as np
>>> from sklearn.model_selection import StratifiedGroupKFold
>>> X = np.ones((17, 2))
>>> y = np.array([0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> groups = np.array([1, 1, 2, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 6, 7, 8, 8])
>>> cv = StratifiedGroupKFold(n_splits=3)
>>> for train_idxs, test_idxs in cv.split(X, y, groups):
...     print("TRAIN:", groups[train_idxs])
...     print("      ", y[train_idxs])
...     print(" TEST:", groups[test_idxs])
...     print("      ", y[test_idxs])
TRAIN: [1 1 2 2 4 5 5 5 5 8 8]
       [0 0 1 1 1 0 0 0 0 0 0]
 TEST: [3 3 3 6 6 7]
       [1 1 1 0 0 0]
TRAIN: [3 3 3 4 5 5 5 5 6 6 7]
       [1 1 1 1 0 0 0 0 0 0 0]
 TEST: [1 1 2 2 8 8]
       [0 0 1 1 0 0]
TRAIN: [1 1 2 2 3 3 3 6 6 7 8 8]
       [0 0 1 1 1 1 1 0 0 0 0 0]
 TEST: [4 5 5 5 5]
       [1 0 0 0 0]

相关用法

注：本文由纯净天空筛选整理自scikit-learn.org大神的英文原创作品 sklearn.model_selection.StratifiedGroupKFold。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。