R tidyr separate_wider_delim 將字符串拆分為列

這些函數中的每一個都采用一個字符串列並將其拆分為多個新列：

separate_wider_delim() 按分隔符分割。
separate_wider_position() 以固定寬度分割。
separate_wider_regex() 使用正則表達式匹配進行拆分。

這些函數相當於 separate() 和 extract() ，但使用 stringr 作為底層字符串操作引擎，它們的接口反映了我們從 unnest_wider() 和 unnest_longer() 中學到的東西。

用法

separate_wider_delim(
  data,
  cols,
  delim,
  ...,
  names = NULL,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start", "align_end"),
  too_many = c("error", "debug", "drop", "merge"),
  cols_remove = TRUE
)

separate_wider_position(
  data,
  cols,
  widths,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  too_many = c("error", "debug", "drop"),
  cols_remove = TRUE
)

separate_wider_regex(
  data,
  cols,
  patterns,
  ...,
  names_sep = NULL,
  names_repair = "check_unique",
  too_few = c("error", "debug", "align_start"),
  cols_remove = TRUE
)

參數

data

一個 DataFrame 。

cols

< tidy-select > 要分隔的列。

delim

對於 separate_wider_delim() ，給出值之間的分隔符的字符串。默認情況下，它被解釋為固定字符串；使用stringr::regex()和朋友以其他方式進行拆分。

...

這些點用於將來的擴展，並且必須為空。

names

對於 separate_wider_delim() ，輸出列名稱的字符向量。如果您不希望某些組件出現在輸出中，請使用NA；非 NA 元素的數量決定結果中新列的數量。

names_sep

如果提供，輸出名稱將由輸入列名稱、分隔符和新列名稱組成。 cols 選擇多列時必需。

對於 separate_wider_delim()，您可以指定而不是 names ，在這種情況下，名稱將從源列名稱 names_sep 和數字後綴生成。

names_repair

用於檢查輸出數據幀是否具有有效名稱。必須是以下選項之一：

"minimal“：沒有名稱修複或檢查，超出基本存在，
"unique“：確保名稱唯一且不為空，
"check_unique"：(默認)，不進行名稱修複，但檢查它們是否唯一，
"universal“：使名稱具有唯一性和語法性
函數：應用自定義名稱修複。
tidyr_legacy ：使用 tidyr 0.8 中的名稱 Repair。
公式：purrr-style 匿名函數(參見rlang::as_function())

有關這些術語以及用於執行它們的策略的更多詳細信息，請參閱vctrs::vec_as_names()。

too_few

如果一個值分成太少的部分會發生什麽？

"error"(默認值)將引發錯誤。
"debug" 在輸出中添加額外的列，以幫助您找到並解決根本問題。此選項旨在幫助您調試問題和解決問題，通常不應保留在最終代碼中。
"align_start" 對齊短匹配的開頭，在末尾添加 NA 以填充到正確的長度。
"align_end"(僅separate_wider_delim())對齊短匹配的末尾，在開頭添加NA以填充到正確的長度。

too_many

如果一個值分成太多部分會發生什麽？

"error"(默認值)將引發錯誤。
"debug" 將向輸出添加額外的列，以幫助您找到並解決根本問題。
"drop" 會默默地丟棄任何多餘的碎片。
"merge"(僅限separate_wider_delim())會將任何其他部分合並在一起。

cols_remove

是否應該從輸出中刪除輸入cols？如果 too_few 或 too_many 設置為 "debug" ，則始終為 FALSE 。

widths

命名數字向量，其中名稱成為列名稱，值指定列寬度。未命名的組件將匹配，但不包含在輸出中。

patterns

命名字符向量，其中名稱成為列名稱，值是與向量內容匹配的正則表達式。未命名的組件將匹配，但不包含在輸出中。

值

基於data的數據幀。它具有相同的行，但不同的列：

這些函數的主要目的是從字符串的組成部分創建新列。對於separate_wider_delim()，新列的名稱來自names。對於 separate_wider_position() ，名稱來自 widths 的名稱。對於 separate_wider_regex() ，名稱來自 patterns 的名稱。
如果 too_few 或 too_many 是 "debug" ，輸出將包含對調試有用的其他列：
- {col}_ok：一個邏輯向量，告訴您輸入是否正確。用於快速找到有問題的行。
- {col}_remainder：分離後剩餘的任何文本。
- {col}_pieces 、 {col}_width 、 {col}_matches ：分別為 separate_wider_delim() 、 separate_wider_position() 和 separate_regexp_wider() 的塊數、字符數和匹配數。
如果cols_remove = TRUE(默認值)，輸入cols將從輸出中刪除。

例子

df <- tibble(id = 1:3, x = c("m-123", "f-455", "f-123"))
# There are three basic ways to split up a string into pieces:
# 1. with a delimiter
df %>% separate_wider_delim(x, delim = "-", names = c("gender", "unit"))
#> # A tibble: 3 × 3
#>      id gender unit 
#>   <int> <chr>  <chr>
#> 1     1 m      123  
#> 2     2 f      455  
#> 3     3 f      123  
# 2. by length
df %>% separate_wider_position(x, c(gender = 1, 1, unit = 3))
#> # A tibble: 3 × 3
#>      id gender unit 
#>   <int> <chr>  <chr>
#> 1     1 m      123  
#> 2     2 f      455  
#> 3     3 f      123  
# 3. defining each component with a regular expression
df %>% separate_wider_regex(x, c(gender = ".", ".", unit = "\\d+"))
#> # A tibble: 3 × 3
#>      id gender unit 
#>   <int> <chr>  <chr>
#> 1     1 m      123  
#> 2     2 f      455  
#> 3     3 f      123  

# Sometimes you split on the "last" delimiter
df <- tibble(var = c("race_1", "race_2", "age_bucket_1", "age_bucket_2"))
# _delim won't help because it always splits on the first delimiter
try(df %>% separate_wider_delim(var, "_", names = c("var1", "var2")))
#> Error in separate_wider_delim(., var, "_", names = c("var1", "var2")) : 
#>   Expected 2 pieces in each element of `var`.
#> ! 2 values were too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
df %>% separate_wider_delim(var, "_", names = c("var1", "var2"), too_many = "merge")
#> # A tibble: 4 × 2
#>   var1  var2    
#>   <chr> <chr>   
#> 1 race  1       
#> 2 race  2       
#> 3 age   bucket_1
#> 4 age   bucket_2
# Instead, you can use _regex
df %>% separate_wider_regex(var, c(var1 = ".*", "_", var2 = ".*"))
#> # A tibble: 4 × 2
#>   var1       var2 
#>   <chr>      <chr>
#> 1 race       1    
#> 2 race       2    
#> 3 age_bucket 1    
#> 4 age_bucket 2    
# this works because * is greedy; you can mimic the _delim behaviour with .*?
df %>% separate_wider_regex(var, c(var1 = ".*?", "_", var2 = ".*"))
#> # A tibble: 4 × 2
#>   var1  var2    
#>   <chr> <chr>   
#> 1 race  1       
#> 2 race  2       
#> 3 age   bucket_1
#> 4 age   bucket_2

# If the number of components varies, it's most natural to split into rows
df <- tibble(id = 1:4, x = c("x", "x y", "x y z", NA))
df %>% separate_longer_delim(x, delim = " ")
#> # A tibble: 7 × 2
#>      id x    
#>   <int> <chr>
#> 1     1 x    
#> 2     2 x    
#> 3     2 y    
#> 4     3 x    
#> 5     3 y    
#> 6     3 z    
#> 7     4 NA   
# But separate_wider_delim() provides some tools to deal with the problem
# The default behaviour tells you that there's a problem
try(df %>% separate_wider_delim(x, delim = " ", names = c("a", "b")))
#> Error in separate_wider_delim(., x, delim = " ", names = c("a", "b")) : 
#>   Expected 2 pieces in each element of `x`.
#> ! 1 value was too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
#> ! 1 value was too long.
#> ℹ Use `too_many = "debug"` to diagnose the problem.
#> ℹ Use `too_many = "drop"/"merge"` to silence this message.
# You can get additional insight by using the debug options
df %>%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "debug",
    too_many = "debug"
  )
#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#> `x_remainder`.
#> # A tibble: 4 × 7
#>      id a     b     x     x_ok  x_pieces x_remainder
#>   <int> <chr> <chr> <chr> <lgl>    <int> <chr>      
#> 1     1 x     NA    x     FALSE        1 ""         
#> 2     2 x     y     x y   TRUE         2 ""         
#> 3     3 x     y     x y z FALSE        3 " z"       
#> 4     4 NA    NA    NA    TRUE        NA  NA        

# But you can suppress the warnings
df %>%
  separate_wider_delim(
    x,
    delim = " ",
    names = c("a", "b"),
    too_few = "align_start",
    too_many = "merge"
  )
#> # A tibble: 4 × 3
#>      id a     b    
#>   <int> <chr> <chr>
#> 1     1 x     NA   
#> 2     2 x     y    
#> 3     3 x     y z  
#> 4     4 NA    NA   

# Or choose to automatically name the columns, producing as many as needed
df %>% separate_wider_delim(x, delim = " ", names_sep = "", too_few = "align_start")
#> # A tibble: 4 × 4
#>      id x1    x2    x3   
#>   <int> <chr> <chr> <chr>
#> 1     1 x     NA    NA   
#> 2     2 x     y     NA   
#> 3     3 x     y     z    
#> 4     4 NA    NA    NA

源代碼：R/separate-wider.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Split a string into columns。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。