当前位置: 首页>>代码示例 >>用法及示例精选 >>正文


R rvest html_text 获取元素文本


有两种方法可以从元素中检索文本: html_text()html_text2()html_text()xml2::xml_text() 的薄包装,它仅返回原始底层文本。 html_text2() 使用受 JavaScript innerText() 启发的方法来模拟文本在浏览器中的外观。粗略地说,它将 <br /> 转换为 "\n" ,在 <p> 标记周围添加空行,并稍微格式化表格数据。

html_text2() 通常是您想要的,但它比 html_text() 慢得多,因此对于性能很重要的简单应用程序,您可能需要使用 html_text() 代替。

用法

html_text(x, trim = FALSE)

html_text2(x, preserve_nbsp = FALSE)

参数

x

文档、节点或节点集。

trim

如果 TRUE 将修剪前导和尾随空格。

preserve_nbsp

是否应该保留不间断空格?默认情况下,html_text2() 转换为普通空间以方便进一步计算。当 preserve_nbspTRUE 时,&nbsp; 将在字符串中显示为 "\ua0" 。这通常会引起混乱,因为它的打印方式与 " " 相同。

x 长度相同的字符向量

例子

# To understand the difference between html_text() and html_text2()
# take the following html:

html <- minimal_html(
  "<p>This is a paragraph.
    This another sentence.<br>This should start on a new line"
)

# html_text() returns the raw underlying text, which includes whitespace
# that would be ignored by a browser, and ignores the <br>
html %>% html_element("p") %>% html_text() %>% writeLines()
#> This is a paragraph.
#>     This another sentence.This should start on a new line

# html_text2() simulates what a browser would display. Non-significant
# whitespace is collapsed, and <br> is turned into a line break
html %>% html_element("p") %>% html_text2() %>% writeLines()
#> This is a paragraph. This another sentence.
#> This should start on a new line

# By default, html_text2() also converts non-breaking spaces to regular
# spaces:
html <- minimal_html("<p>x&nbsp;y</p>")
x1 <- html %>% html_element("p") %>% html_text()
x2 <- html %>% html_element("p") %>% html_text2()

# When printed, non-breaking spaces look exactly like regular spaces
x1
#> [1] "x y"
x2
#> [1] "x y"
# But aren't actually the same:
x1 == x2
#> [1] FALSE
# Which you can confirm by looking at their underlying binary
# representaion:
charToRaw(x1)
#> [1] 78 c2 a0 79
charToRaw(x2)
#> [1] 78 20 79
源代码:R/text.R

相关用法


注:本文由纯净天空筛选整理自Hadley Wickham等大神的英文原创作品 Get element text。非经特殊声明,原始代码版权归原作者所有,本译文未经允许或授权,请勿转载或复制。