R rvest html_text 獲取元素文本

有兩種方法可以從元素中檢索文本： html_text() 和 html_text2() 。 html_text() 是 xml2::xml_text() 的薄包裝，它僅返回原始底層文本。 html_text2() 使用受 JavaScript innerText() 啟發的方法來模擬文本在瀏覽器中的外觀。粗略地說，它將 <br /> 轉換為 "\n" ，在 <p> 標記周圍添加空行，並稍微格式化表格數據。

html_text2() 通常是您想要的，但它比 html_text() 慢得多，因此對於性能很重要的簡單應用程序，您可能需要使用 html_text() 代替。

用法

html_text(x, trim = FALSE)

html_text2(x, preserve_nbsp = FALSE)

參數

x: 文檔、節點或節點集。
trim: 如果 TRUE 將修剪前導和尾隨空格。
preserve_nbsp: 是否應該保留不間斷空格？默認情況下，html_text2() 轉換為普通空間以方便進一步計算。當 preserve_nbsp 為 TRUE 時，  將在字符串中顯示為 "\ua0" 。這通常會引起混亂，因為它的打印方式與 " " 相同。

值

與 x 長度相同的字符向量

例子

# To understand the difference between html_text() and html_text2()
# take the following html:

html <- minimal_html(
  "<p>This is a paragraph.
    This another sentence.<br>This should start on a new line"
)

# html_text() returns the raw underlying text, which includes whitespace
# that would be ignored by a browser, and ignores the <br>
html %>% html_element("p") %>% html_text() %>% writeLines()
#> This is a paragraph.
#>     This another sentence.This should start on a new line

# html_text2() simulates what a browser would display. Non-significant
# whitespace is collapsed, and <br> is turned into a line break
html %>% html_element("p") %>% html_text2() %>% writeLines()
#> This is a paragraph. This another sentence.
#> This should start on a new line

# By default, html_text2() also converts non-breaking spaces to regular
# spaces:
html <- minimal_html("<p>x&nbsp;y</p>")
x1 <- html %>% html_element("p") %>% html_text()
x2 <- html %>% html_element("p") %>% html_text2()

# When printed, non-breaking spaces look exactly like regular spaces
x1
#> [1] "x y"
x2
#> [1] "x y"
# But aren't actually the same:
x1 == x2
#> [1] FALSE
# Which you can confirm by looking at their underlying binary
# representaion:
charToRaw(x1)
#> [1] 78 c2 a0 79
charToRaw(x2)
#> [1] 78 20 79

源代碼：R/text.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Get element text。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。