R tidyr separate_wider_delim 将字符串拆分为列

这些函数中的每一个都采用一个字符串列并将其拆分为多个新列：

separate_wider_delim() 按分隔符分割。
separate_wider_position() 以固定宽度分割。
separate_wider_regex() 使用正则表达式匹配进行拆分。

这些函数相当于 separate() 和 extract() ，但使用 stringr 作为底层字符串操作引擎，它们的接口反映了我们从 unnest_wider() 和 unnest_longer() 中学到的东西。

用法

separate_wider_delim(
  data,
  cols,
  delim,
  ...,
  names = NULL,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start", "align_end"),
  too_many = c("error", "debug", "drop", "merge"),
  cols_remove = TRUE
)

separate_wider_position(
  data,
  cols,
  widths,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  too_many = c("error", "debug", "drop"),
  cols_remove = TRUE
)

separate_wider_regex(
  data,
  cols,
  patterns,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  cols_remove = TRUE
)

参数

data

一个 DataFrame 。

cols

< tidy-select > 要分隔的列。

delim

对于 separate_wider_delim() ，给出值之间的分隔符的字符串。默认情况下，它被解释为固定字符串；使用stringr::regex()和朋友以其他方式进行拆分。

...

这些点用于将来的扩展，并且必须为空。

names

对于 separate_wider_delim() ，输出列名称的字符向量。如果您不希望某些组件出现在输出中，请使用NA；非 NA 元素的数量决定结果中新列的数量。

names_sep

如果提供，输出名称将由输入列名称、分隔符和新列名称组成。 cols 选择多列时必需。

对于 separate_wider_delim()，您可以指定而不是 names ，在这种情况下，名称将从源列名称 names_sep 和数字后缀生成。

names_repair

用于检查输出数据帧是否具有有效名称。必须是以下选项之一：

"minimal“：没有名称修复或检查，超出基本存在，
"unique“：确保名称唯一且不为空，
"check_unique"：(默认)，不进行名称修复，但检查它们是否唯一，
"universal“：使名称具有唯一性和语法性
函数：应用自定义名称修复。
tidyr_legacy ：使用 tidyr 0.8 中的名称 Repair。
公式：purrr-style 匿名函数(参见rlang::as_function())

有关这些术语以及用于执行它们的策略的更多详细信息，请参阅vctrs::vec_as_names()。

too_few

如果一个值分成太少的部分会发生什么？

"error"(默认值)将引发错误。
"debug" 在输出中添加额外的列，以帮助您找到并解决根本问题。此选项旨在帮助您调试问题和解决问题，通常不应保留在最终代码中。
"align_start" 对齐短匹配的开头，在末尾添加 NA 以填充到正确的长度。
"align_end"(仅separate_wider_delim())对齐短匹配的末尾，在开头添加NA以填充到正确的长度。

too_many

如果一个值分成太多部分会发生什么？

"error"(默认值)将引发错误。
"debug" 将向输出添加额外的列，以帮助您找到并解决根本问题。
"drop" 会默默地丢弃任何多余的碎片。
"merge"(仅限separate_wider_delim())会将任何其他部分合并在一起。

cols_remove

是否应该从输出中删除输入cols？如果 too_few 或 too_many 设置为 "debug" ，则始终为 FALSE 。

widths

命名数字向量，其中名称成为列名称，值指定列宽度。未命名的组件将匹配，但不包含在输出中。

patterns

命名字符向量，其中名称成为列名称，值是与向量内容匹配的正则表达式。未命名的组件将匹配，但不包含在输出中。

值

基于data的数据帧。它具有相同的行，但不同的列：

这些函数的主要目的是从字符串的组成部分创建新列。对于separate_wider_delim()，新列的名称来自names。对于 separate_wider_position() ，名称来自 widths 的名称。对于 separate_wider_regex() ，名称来自 patterns 的名称。
如果 too_few 或 too_many 是 "debug" ，输出将包含对调试有用的其他列：
- {col}_ok：一个逻辑向量，告诉您输入是否正确。用于快速找到有问题的行。
- {col}_remainder：分离后剩余的任何文本。
- {col}_pieces 、 {col}_width 、 {col}_matches ：分别为 separate_wider_delim() 、 separate_wider_position() 和 separate_regexp_wider() 的块数、字符数和匹配数。
如果cols_remove = TRUE(默认值)，输入cols将从输出中删除。

例子

df <- tibble(id = 1:3, x = c("m-123", "f-455", "f-123"))
# There are three basic ways to split up a string into pieces:
# 1. with a delimiter
df %>% separate_wider_delim(x, delim = "-", names = c("gender", "unit"))
#> # A tibble: 3 × 3
#>      id gender unit 
#>   <int> <chr>  <chr>
#> 1     1 m      123  
#> 2     2 f      455  
#> 3     3 f      123  
# 2. by length
df %>% separate_wider_position(x, c(gender = 1, 1, unit = 3))
#> # A tibble: 3 × 3
#>      id gender unit 
#>   <int> <chr>  <chr>
#> 1     1 m      123  
#> 2     2 f      455  
#> 3     3 f      123  
# 3. defining each component with a regular expression
df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+"))
#> # A tibble: 3 × 3
#>      id gender unit 
#>   <int> <chr>  <chr>
#> 1     1 m      123  
#> 2     2 f      455  
#> 3     3 f      123  

# Sometimes you split on the "last" delimiter
df <- tibble(var = c("race_1", "race_2", "age_bucket_1", "age_bucket_2"))
# _delim won't help because it always splits on the first delimiter
try(df %>% separate_wider_delim(var, "_", names = c("var1", "var2")))
#> Error in separate_wider_delim(., var, "_", names = c("var1", "var2")) : 
#>   Expected 2 pieces in each element of `var`.
#> ! 2 values were too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
df %>% separate_wider_delim(var, "_", names = c("var1", "var2"), too_many = "merge")
#> # A tibble: 4 × 2
#>   var1  var2    
#>   <chr> <chr>   
#> 1 race  1       
#> 2 race  2       
#> 3 age   bucket_1
#> 4 age   bucket_2
# Instead, you can use _regex
df %>% separate_wider_regex(var, c(var1 = ".*", "_", var2 = ".*"))
#> # A tibble: 4 × 2
#>   var1       var2 
#>   <chr>      <chr>
#> 1 race       1    
#> 2 race       2    
#> 3 age_bucket 1    
#> 4 age_bucket 2    
# this works because * is greedy; you can mimic the _delim behaviour with .*?
df %>% separate_wider_regex(var, c(var1 = ".*?", "_", var2 = ".*"))
#> # A tibble: 4 × 2
#>   var1  var2    
#>   <chr> <chr>   
#> 1 race  1       
#> 2 race  2       
#> 3 age   bucket_1
#> 4 age   bucket_2

# If the number of components varies, it's most natural to split into rows
df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA))
df %>% separate_longer_delim(x, delim = " ")
#> # A tibble: 7 × 2
#>      id x    
#>   <int> <chr>
#> 1     1 x    
#> 2     2 x    
#> 3     2 y    
#> 4     3 x    
#> 5     3 y    
#> 6     3 z    
#> 7     4 NA   
# But separate_wider_delim() provides some tools to deal with the problem
# The default behaviour tells you that there's a problem
try(df %>% separate_wider_delim(x, delim = " ", names = c("a", "b")))
#> Error in separate_wider_delim(., x, delim = " ", names = c("a", "b")) : 
#>   Expected 2 pieces in each element of `x`.
#> ! 1 value was too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
#> ! 1 value was too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
# You can get additional insight by using the debug options
df %>%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "debug",
    too_many = "debug"
  )
#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#> `x_remainder`.
#> # A tibble: 4 × 7
#>      id a     b     x     x_ok  x_pieces x_remainder
#>   <int> <chr> <chr> <chr> <lgl>    <int> <chr>      
#> 1     1 x     NA    x     FALSE        1 ""         
#> 2     2 x     y     x y   TRUE         2 ""         
#> 3     3 x     y     x y z FALSE        3 " z"       
#> 4     4 NA    NA    NA    TRUE        NA  NA        

# But you can suppress the warnings
df %>%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "align_start",
    too_many = "merge"
  )
#> # A tibble: 4 × 3
#>      id a     b    
#>   <int> <chr> <chr>
#> 1     1 x     NA   
#> 2     2 x     y    
#> 3     3 x     y z  
#> 4     4 NA    NA   

# Or choose to automatically name the columns, producing as many as needed
df %>% separate_wider_delim(x, delim = " ", names_sep = "", too_few = "align_start")
#> # A tibble: 4 × 4
#>      id x1    x2    x3   
#>   <int> <chr> <chr> <chr>
#> 1     1 x     NA    NA   
#> 2     2 x     y     NA   
#> 3     3 x     y     z    
#> 4     4 NA    NA    NA

源代码：R/separate-wider.R

相关用法

注：本文由纯净天空筛选整理自Hadley Wickham等大神的英文原创作品 Split a string into columns。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。