一个家庭,用于将满足某些标准的级别集中在一起。
-
fct_lump_min()
:出现次数少于min
的块级别。 -
fct_lump_prop()
:出现次数少于(或等于)prop * n
的肿块级别。 -
fct_lump_n()
集中除最常见的n
之外的所有级别(如果n < 0
则为最不频繁) -
fct_lump_lowfreq()
将最不频繁的级别集中在一起,确保 "other" 仍然是最小的级别。
fct_lump()
的存在主要是出于历史原因,因为它根据其参数自动在这些不同的方法之间进行选择。我们不再建议您使用它。
用法
fct_lump(
f,
n,
prop,
w = NULL,
other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max")
)
fct_lump_min(f, min, w = NULL, other_level = "Other")
fct_lump_prop(f, prop, w = NULL, other_level = "Other")
fct_lump_n(
f,
n,
w = NULL,
other_level = "Other",
ties.method = c("min", "average", "first", "last", "random", "max")
)
fct_lump_lowfreq(f, w = NULL, other_level = "Other")
参数
- f
-
因子(或字符向量)。
- n
-
正值
n
保留最常见的n
值。负值n
保留最不常见的-n
值。如果存在平局,您将至少获得abs(n)
值。 - prop
-
正的
prop
块值至少在prop
时间内不出现。负prop
最多不会出现-prop
时间的值。 - w
-
一个可选的数值向量,给出 f 中每个值(不是级别)的频率权重。
- other_level
-
用于 "other" 值的级别值。始终放置在关卡末尾。
- ties.method
-
指定如何处理关系的字符串。有关详细信息,请参阅
rank()
。 - min
-
保留至少出现
min
次的级别。
也可以看看
fct_other()
将指定级别转换为其他级别。
例子
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% table()
#> .
#> A B C D E F G H I
#> 40 10 5 27 1 1 1 1 1
x %>%
fct_lump_n(3) %>%
table()
#> .
#> A B D Other
#> 40 10 27 10
x %>%
fct_lump_prop(0.10) %>%
table()
#> .
#> A B D Other
#> 40 10 27 10
x %>%
fct_lump_min(5) %>%
table()
#> .
#> A B C D Other
#> 40 10 5 27 5
x %>%
fct_lump_lowfreq() %>%
table()
#> .
#> A D Other
#> 40 27 20
x <- factor(letters[rpois(100, 5)])
x
#> [1] b e d f f g e e a e c e b g c d d b i c d d f b d c g g h e d g b i
#> [35] c h j d d f g c d c h h g d d c b a e e e e g a f c b d b f c d g i
#> [69] b f d d d b e e c a e h e k d g e g d g h d f g a e i g k g l e
#> Levels: a b c d e f g h i j k l
table(x)
#> x
#> a b c d e f g h i j k l
#> 5 10 11 20 17 8 15 6 4 1 2 1
table(fct_lump_lowfreq(x))
#>
#> a b c d e f g h i j k l
#> 5 10 11 20 17 8 15 6 4 1 2 1
# Use positive values to collapse the rarest
fct_lump_n(x, n = 3)
#> [1] Other e d Other Other g e e Other e Other
#> [12] e Other g Other d d Other Other Other d d
#> [23] Other Other d Other g g Other e d g Other
#> [34] Other Other Other Other d d Other g Other d Other
#> [45] Other Other g d d Other Other Other e e e
#> [56] e g Other Other Other Other d Other Other Other d
#> [67] g Other Other Other d d d Other e e Other
#> [78] Other e Other e Other d g e g d g
#> [89] Other d Other g Other e Other g Other g Other
#> [100] e
#> Levels: d e g Other
fct_lump_prop(x, prop = 0.1)
#> [1] Other e d Other Other g e e Other e c
#> [12] e Other g c d d Other Other c d d
#> [23] Other Other d c g g Other e d g Other
#> [34] Other c Other Other d d Other g c d c
#> [45] Other Other g d d c Other Other e e e
#> [56] e g Other Other c Other d Other Other c d
#> [67] g Other Other Other d d d Other e e c
#> [78] Other e Other e Other d g e g d g
#> [89] Other d Other g Other e Other g Other g Other
#> [100] e
#> Levels: c d e g Other
# Use negative values to collapse the most common
fct_lump_n(x, n = -3)
#> [1] Other Other Other Other Other Other Other Other Other Other Other
#> [12] Other Other Other Other Other Other Other Other Other Other Other
#> [23] Other Other Other Other Other Other Other Other Other Other Other
#> [34] Other Other Other j Other Other Other Other Other Other Other
#> [45] Other Other Other Other Other Other Other Other Other Other Other
#> [56] Other Other Other Other Other Other Other Other Other Other Other
#> [67] Other Other Other Other Other Other Other Other Other Other Other
#> [78] Other Other Other Other k Other Other Other Other Other Other
#> [89] Other Other Other Other Other Other Other Other k Other l
#> [100] Other
#> Levels: j k l Other
fct_lump_prop(x, prop = -0.1)
#> [1] b Other Other f f Other Other Other a Other Other
#> [12] Other b Other Other Other Other b i Other Other Other
#> [23] f b Other Other Other Other h Other Other Other b
#> [34] i Other h j Other Other f Other Other Other Other
#> [45] h h Other Other Other Other b a Other Other Other
#> [56] Other Other a f Other b Other b f Other Other
#> [67] Other i b f Other Other Other b Other Other Other
#> [78] a Other h Other k Other Other Other Other Other Other
#> [89] h Other f Other a Other i Other k Other l
#> [100] Other
#> Levels: a b f h i j k l Other
# Use weighted frequencies
w <- c(rep(2, 50), rep(1, 50))
fct_lump_n(x, n = 5, w = w)
#> [1] b e d Other Other g e e Other e c
#> [12] e b g c d d b Other c d d
#> [23] Other b d c g g Other e d g b
#> [34] Other c Other Other d d Other g c d c
#> [45] Other Other g d d c b Other e e e
#> [56] e g Other Other c b d b Other c d
#> [67] g Other b Other d d d b e e c
#> [78] Other e Other e Other d g e g d g
#> [89] Other d Other g Other e Other g Other g Other
#> [100] e
#> Levels: b c d e g Other
# Use ties.method to control how tied factors are collapsed
fct_lump_n(x, n = 6)
#> [1] b e d f f g e e Other e c
#> [12] e b g c d d b Other c d d
#> [23] f b d c g g Other e d g b
#> [34] Other c Other Other d d f g c d c
#> [45] Other Other g d d c b Other e e e
#> [56] e g Other f c b d b f c d
#> [67] g Other b f d d d b e e c
#> [78] Other e Other e Other d g e g d g
#> [89] Other d f g Other e Other g Other g Other
#> [100] e
#> Levels: b c d e f g Other
fct_lump_n(x, n = 6, ties.method = "max")
#> [1] b e d f f g e e Other e c
#> [12] e b g c d d b Other c d d
#> [23] f b d c g g Other e d g b
#> [34] Other c Other Other d d f g c d c
#> [45] Other Other g d d c b Other e e e
#> [56] e g Other f c b d b f c d
#> [67] g Other b f d d d b e e c
#> [78] Other e Other e Other d g e g d g
#> [89] Other d f g Other e Other g Other g Other
#> [100] e
#> Levels: b c d e f g Other
# Use fct_lump_min() to lump together all levels with fewer than `n` values
table(fct_lump_min(x, min = 10))
#>
#> b c d e g Other
#> 10 11 20 17 15 27
table(fct_lump_min(x, min = 15))
#>
#> d e g Other
#> 20 17 15 48
相关用法
- R forcats fct_relevel 手动重新排序因子级别
- R forcats fct_anon 匿名因子水平
- R forcats fct_inorder 按首次出现、频率或数字顺序对因子水平重新排序
- R forcats fct_rev 因子水平的倒序
- R forcats fct_match 测试因子中是否存在水平
- R forcats fct_relabel 使用函数重新标记因子水平,并根据需要折叠
- R forcats fct_drop 删除未使用的级别
- R forcats fct_c 连接因子,组合级别
- R forcats fct_collapse 将因子级别折叠为手动定义的组
- R forcats fct_shuffle 随机排列因子水平
- R forcats fct_cross 组合两个或多个因子的水平以创建新因子
- R forcats fct_other 手动将级别替换为“其他”
- R forcats fct_recode 手动更改因子水平
- R forcats fct_na_value_to_level NA 值和 NA 水平之间的转换
- R forcats fct_unique 一个因子的唯一值,作为一个因子
- R forcats fct_shift 将因子水平向左或向右移动,在末尾环绕
- R forcats fct_unify 统一因子列表中的水平
- R forcats fct_count 计算因子中的条目数
- R forcats fct_expand 向因子添加附加级别
- R forcats fct_reorder 通过沿另一个变量排序来重新排序因子水平
- R forcats fct 创建一个因子
- R forcats as_factor 将输入转换为因子
- R forcats lvls_union 查找因子列表中的所有级别
- R forcats lvls 用于操纵级别的低级函数
- R forcats gss_cat 一般社会调查中的分类变量样本
注:本文由纯净天空筛选整理自Hadley Wickham等大神的英文原创作品 Lump uncommon factor together levels into "other"。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。