R stringr modifiers 使用修饰符函数控制匹配行为

修饰符函数控制 stringr 函数的 pattern 参数的含义：

boundary()：匹配事物之间的边界。
coll() ：使用标准 Unicode 排序规则比较字符串。
fixed() ：比较文字字节。
regex()(默认值)：使用 ICU 正则表达式。

用法

fixed(pattern, ignore_case = FALSE)

coll(pattern, ignore_case = FALSE, locale = "en", ...)

regex(
  pattern,
  ignore_case = FALSE,
  multiline = FALSE,
  comments = FALSE,
  dotall = FALSE,
  ...
)

boundary(
  type = c("character", "line_break", "sentence", "word"),
  skip_word_none = NA,
  ...
)

参数

pattern

修改行为的模式。

ignore_case

比赛中是否应该忽略大小写差异？对于 fixed() ，这使用了一种简单的算法，该算法假设大写字母和小写字母之间存在一对一的映射。

locale

用于比较的区域设置。有关所有可能的选项，请参阅stringi::stri_locale_list()。默认为 "en"(英语)，以确保默认行为在不同平台上保持一致。

...

其他不太常用的参数传递给 stringi::stri_opts_collator() 、 stringi::stri_opts_regex() 或 stringi::stri_opts_brkiter()

multiline

如果 TRUE 、 $ 和 ^ 匹配每行的开头和结尾。如果是 FALSE ，默认情况下，仅匹配输入的开始和结束。

comments

如果 TRUE ，则忽略以 # 开头的空格和注释。使用 \\ 转义文字空格。

dotall

如果 TRUE ，. 也将匹配行终止符。

type

要检测的边界类型。

character: 每个字符都是一个边界。
line_break: 边界是当前语言环境中可以接受换行的地方。
sentence: 句子的开头和结尾是边界，使用智能规则来避免计算缩写(details)。
word: 单词的开头和结尾是边界。

skip_word_none

忽略不包含任何字符或数字(即标点符号)的"words"。默认情况下NA仅在word边界上分割时才会跳过此类"words"。

值

stringr 修饰符对象，即具有父 S3 类 stringr_pattern 的字符向量。

例子

pattern <- "a.b"
strings <- c("abb", "a.b")
str_detect(strings, pattern)
#> [1] TRUE TRUE
str_detect(strings, fixed(pattern))
#> [1] FALSE  TRUE
str_detect(strings, coll(pattern))
#> [1] FALSE  TRUE

# coll() is useful for locale-aware case-insensitive matching
i <- c("I", "\u0130", "i")
i
#> [1] "I" "İ" "i"
str_detect(i, fixed("i", TRUE))
#> [1]  TRUE FALSE  TRUE
str_detect(i, coll("i", TRUE))
#> [1]  TRUE FALSE  TRUE
str_detect(i, coll("i", TRUE, locale = "tr"))
#> [1] FALSE  TRUE  TRUE

# Word boundaries
words <- c("These are   some words.")
str_count(words, boundary("word"))
#> [1] 4
str_split(words, " ")[[1]]
#> [1] "These"  "are"    ""       ""       "some"   "words."
str_split(words, boundary("word"))[[1]]
#> [1] "These" "are"   "some"  "words"

# Regular expression variations
str_extract_all("The Cat in the Hat", "[a-z]+")
#> [[1]]
#> [1] "he"  "at"  "in"  "the" "at" 
#> 
str_extract_all("The Cat in the Hat", regex("[a-z]+", TRUE))
#> [[1]]
#> [1] "The" "Cat" "in"  "the" "Hat"
#> 

str_extract_all("a\nb\nc", "^.")
#> [[1]]
#> [1] "a"
#> 
str_extract_all("a\nb\nc", regex("^.", multiline = TRUE))
#> [[1]]
#> [1] "a" "b" "c"
#> 

str_extract_all("a\nb\nc", "a.")
#> [[1]]
#> character(0)
#> 
str_extract_all("a\nb\nc", regex("a.", dotall = TRUE))
#> [[1]]
#> [1] "a\n"
#>

源代码：R/modifiers.R

相关用法

注：本文由纯净天空筛选整理自Hadley Wickham等大神的英文原创作品 Control matching behaviour with modifier functions。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。