R rvest html_text 获取元素文本

有两种方法可以从元素中检索文本： html_text() 和 html_text2() 。 html_text() 是 xml2::xml_text() 的薄包装，它仅返回原始底层文本。 html_text2() 使用受 JavaScript innerText() 启发的方法来模拟文本在浏览器中的外观。粗略地说，它将 <br /> 转换为 "\n" ，在 <p> 标记周围添加空行，并稍微格式化表格数据。

html_text2() 通常是您想要的，但它比 html_text() 慢得多，因此对于性能很重要的简单应用程序，您可能需要使用 html_text() 代替。

用法

html_text(x, trim = FALSE)

html_text2(x, preserve_nbsp = FALSE)

参数

x: 文档、节点或节点集。
trim: 如果 TRUE 将修剪前导和尾随空格。
preserve_nbsp: 是否应该保留不间断空格？默认情况下，html_text2() 转换为普通空间以方便进一步计算。当 preserve_nbsp 为 TRUE 时，  将在字符串中显示为 "\ua0" 。这通常会引起混乱，因为它的打印方式与 " " 相同。

值

与 x 长度相同的字符向量

例子

# To understand the difference between html_text() and html_text2()
# take the following html:

html <- minimal_html(
  "<p>This is a paragraph.
    This another sentence.<br>This should start on a new line"
)

# html_text() returns the raw underlying text, which includes whitespace
# that would be ignored by a browser, and ignores the <br>
html %>% html_element("p") %>% html_text() %>% writeLines()
#> This is a paragraph.
#>     This another sentence.This should start on a new line

# html_text2() simulates what a browser would display. Non-significant
# whitespace is collapsed, and <br> is turned into a line break
html %>% html_element("p") %>% html_text2() %>% writeLines()
#> This is a paragraph. This another sentence.
#> This should start on a new line

# By default, html_text2() also converts non-breaking spaces to regular
# spaces:
html <- minimal_html("<p>x&nbsp;y</p>")
x1 <- html %>% html_element("p") %>% html_text()
x2 <- html %>% html_element("p") %>% html_text2()

# When printed, non-breaking spaces look exactly like regular spaces
x1
#> [1] "x y"
x2
#> [1] "x y"
# But aren't actually the same:
x1 == x2
#> [1] FALSE
# Which you can confirm by looking at their underlying binary
# representaion:
charToRaw(x1)
#> [1] 78 c2 a0 79
charToRaw(x2)
#> [1] 78 20 79

源代码：R/text.R

相关用法

注：本文由纯净天空筛选整理自Hadley Wickham等大神的英文原创作品 Get element text。非经特殊声明，原始代码版权归原作者所有，本译文未经允许或授权，请勿转载或复制。