R regmatches 提取或替换匹配的子字符串

R语言 regmatches 位于 base 包(package)。

说明

从 regexpr 、 gregexpr 、 regexec 或 gregexec 获得的匹配数据中提取或替换匹配的子字符串。

用法

regmatches(x, m, invert = FALSE)
regmatches(x, m, invert = FALSE) <- value

参数

`x`	一个字符向量
`m`	具有匹配数据的对象
`invert`	逻辑：如果 `TRUE` ，则提取或替换不匹配的子字符串。
`value`	具有匹配或不匹配子字符串的合适替换值的对象(请参阅 `Details` )。

细节

如果invert 是FALSE(默认)，则regmatches 提取匹配数据指定的匹配子字符串。对于向量匹配数据(从 regexpr 获得)，空匹配将被删除；对于列表匹配数据，空匹配给出空组件(零长度字符向量)。

如果invert为TRUE，则regmatches提取未匹配的子串，即按照类似strsplit的匹配进行分割(对于向量匹配数据，最多进行一次分割)。

如果 invert 是 NA ，则 regmatches 提取不匹配和匹配的子字符串，始终以不匹配开始和结束(如果匹配分别发生在开头或结尾，则为空)。

请注意，匹配数据可以通过对具有相同字符数的修改版本x进行正则表达式匹配来获得。

替换函数可用于替换匹配或不匹配的子字符串。对于向量匹配数据，如果 invert 是 FALSE ，则 value 应该是长度为 m 中匹配元素数量的字符向量。否则，它应该是与 m 长度相同的字符向量列表，每个字符向量与所需的替换数量一样长。替换将值强制为字符或列表，并根据需要慷慨地回收值。不允许缺少替换值。

值

对于 regmatches ，如果 m 是向量且 invert 是 FALSE ，则为具有匹配子字符串的字符向量。否则，包含匹配或/和不匹配子字符串的列表。

对于 regmatches<- ，更新后的字符向量。

例子

x <- c("A and B", "A, B and C", "A, B, C and D", "foobar")
pattern <- "[[:space:]]*(,|and)[[:space:]]"
## Match data from regexpr()
m <- regexpr(pattern, x)
regmatches(x, m)
regmatches(x, m, invert = TRUE)
## Match data from gregexpr()
m <- gregexpr(pattern, x)
regmatches(x, m)
regmatches(x, m, invert = TRUE)

## Consider
x <- "John (fishing, hunting), Paul (hiking, biking)"
## Suppose we want to split at the comma (plus spaces) between the
## persons, but not at the commas in the parenthesized hobby lists.
## One idea is to "blank out" the parenthesized parts to match the
## parts to be used for splitting, and extract the persons as the
## non-matched parts.
## First, match the parenthesized hobby lists.
m <- gregexpr("\\([^)]*\\)", x)
## Create blank strings with given numbers of characters.
blanks <- function(n) strrep(" ", n)
## Create a copy of x with the parenthesized parts blanked out.
s <- x
regmatches(s, m) <- Map(blanks, lapply(regmatches(s, m), nchar))
s
## Compute the positions of the split matches (note that we cannot call
## strsplit() on x with match data from s).
m <- gregexpr(", *", s)
## And finally extract the non-matched parts.
regmatches(x, m, invert = TRUE)

## regexec() and gregexec() return overlapping ranges because the
## first match is the full match.  This conflicts with regmatches()<-
## and regmatches(..., invert=TRUE).  We can work-around by dropping
## the first match.
drop_first <- function(x) {
    if(!anyNA(x) && all(x > 0)) {
        ml <- attr(x, 'match.length')
        if(is.matrix(x)) x <- x[-1,] else x <- x[-1]
        attr(x, 'match.length') <- if(is.matrix(ml)) ml[-1,] else ml[-1]
    }
    x
}
m <- gregexec("(\\w+) \\(((?:\\w+(?:, )?)+)\\)", x)
regmatches(x, m)
try(regmatches(x, m, invert=TRUE))
regmatches(x, lapply(m, drop_first))
## invert=TRUE loses matrix structure because we are retrieving what
## is in between every sub-match
regmatches(x, lapply(m, drop_first), invert=TRUE)
y <- z <- x
## Notice **list**(...) on the RHS
regmatches(y, lapply(m, drop_first)) <- list(c("<NAME>", "<HOBBY-LIST>"))
y
regmatches(z, lapply(m, drop_first), invert=TRUE) <-
    list(sprintf("<%d>", 1:5))
z

## With `perl = TRUE` and `invert = FALSE` capture group names
## are preserved.  Collect functions and arguments in calls:
NEWS <- head(readLines(file.path(R.home(), 'doc', 'NEWS.2')), 100)
m <- gregexec("(?<fun>\\w+)\\((?<args>[^)]*)\\)", NEWS, perl = TRUE)
y <- regmatches(NEWS, m)
y[[16]]
## Make tabular, adding original line numbers
mdat <- as.data.frame(t(do.call(cbind, y)))
mdat <- cbind(mdat, line=rep(seq_along(y), lengths(y) / ncol(mdat)))
head(mdat)
NEWS[head(mdat[['line']])]

相关用法

注：本文由纯净天空筛选整理自R-devel大神的英文原创作品 Extract or Replace Matched Substrings。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。