當前位置: 首頁>>編程示例 >>用法及示例精選 >>正文


R rvest html_text 獲取元素文本

有兩種方法可以從元素中檢索文本: html_text()html_text2()html_text()xml2::xml_text() 的薄包裝,它僅返回原始底層文本。 html_text2() 使用受 JavaScript innerText() 啟發的方法來模擬文本在瀏覽器中的外觀。粗略地說,它將 <br /> 轉換為 "\n" ,在 <p> 標記周圍添加空行,並稍微格式化表格數據。

html_text2() 通常是您想要的,但它比 html_text() 慢得多,因此對於性能很重要的簡單應用程序,您可能需要使用 html_text() 代替。

用法

html_text(x, trim = FALSE)

html_text2(x, preserve_nbsp = FALSE)

參數

x

文檔、節點或節點集。

trim

如果 TRUE 將修剪前導和尾隨空格。

preserve_nbsp

是否應該保留不間斷空格?默認情況下,html_text2() 轉換為普通空間以方便進一步計算。當 preserve_nbspTRUE 時,&nbsp; 將在字符串中顯示為 "\ua0" 。這通常會引起混亂,因為它的打印方式與 " " 相同。

x 長度相同的字符向量

例子

# To understand the difference between html_text() and html_text2()
# take the following html:

html <- minimal_html(
  "<p>This is a paragraph.
    This another sentence.<br>This should start on a new line"
)

# html_text() returns the raw underlying text, which includes whitespace
# that would be ignored by a browser, and ignores the <br>
html %>% html_element("p") %>% html_text() %>% writeLines()
#> This is a paragraph.
#>     This another sentence.This should start on a new line

# html_text2() simulates what a browser would display. Non-significant
# whitespace is collapsed, and <br> is turned into a line break
html %>% html_element("p") %>% html_text2() %>% writeLines()
#> This is a paragraph. This another sentence.
#> This should start on a new line

# By default, html_text2() also converts non-breaking spaces to regular
# spaces:
html <- minimal_html("<p>x&nbsp;y</p>")
x1 <- html %>% html_element("p") %>% html_text()
x2 <- html %>% html_element("p") %>% html_text2()

# When printed, non-breaking spaces look exactly like regular spaces
x1
#> [1] "x y"
x2
#> [1] "x y"
# But aren't actually the same:
x1 == x2
#> [1] FALSE
# Which you can confirm by looking at their underlying binary
# representaion:
charToRaw(x1)
#> [1] 78 c2 a0 79
charToRaw(x2)
#> [1] 78 20 79
源代碼:R/text.R

相關用法


注:本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Get element text。非經特殊聲明,原始代碼版權歸原作者所有,本譯文未經允許或授權,請勿轉載或複製。