R rvest html_element 從 HTML 文檔中選擇元素

html_element() 和 html_elements() 使用 CSS 選擇器或 XPath 表達式查找 HTML 元素。 CSS 選擇器與 https://selectorgadget.com/ 結合使用特別有用，這使得您可以輕鬆找到所需的選擇器。

用法

html_element(x, css, xpath)

html_elements(x, css, xpath)

參數

x: 文檔、節點集或單個節點。
css, xpath: 要選擇的元素。根據您要使用 CSS 選擇器還是 XPath 1.0 表達式，提供 css 或 xpath 之一。

值

html_element() 返回與輸入長度相同的節點集。 html_elements() 扁平化輸出，因此沒有直接的方法將輸出映射到輸入。

CSS 選擇器支持

CSS 選擇器通過以下方式轉換為 XPath 選擇器選擇器package，這是python的一個端口CSS選擇 Library ，https://pythonhosted.org/cssselect/.

它實現了大多數 CSS3 選擇器，如中所述https://www.w3.org/TR/2011/REC-css3-selectors-20110929/。下麵列出了例外情況：

需要交互性的偽選擇器將被忽略： :hover 、 :active 、 :focus 、 :target 、 :visited 。
以下偽類不適用於通配符元素 *： *:first-of-type 、 *:last-of-type 、 *:nth-of-type 、 *:nth-last-of-type 、 *:only-of-type
它支持:contains(text)
可以使用!=，[foo!=bar]與:not([foo=bar])相同
:not() 接受一係列簡單選擇器，而不僅僅是一個簡單選擇器。

例子

html <- minimal_html("
  <h1>This is a heading</h1>
  <p id='first'>This is a paragraph</p>
  <p class='important'>This is an important paragraph</p>
")

html %>% html_element("h1")
#> {html_node}
#> <h1>
html %>% html_elements("p")
#> {xml_nodeset (2)}
#> [1] <p id="first">This is a paragraph</p>
#> [2] <p class="important">This is an important paragraph</p>
html %>% html_elements(".important")
#> {xml_nodeset (1)}
#> [1] <p class="important">This is an important paragraph</p>
html %>% html_elements("#first")
#> {xml_nodeset (1)}
#> [1] <p id="first">This is a paragraph</p>

# html_element() vs html_elements() --------------------------------------
html <- minimal_html("
  <ul>
    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
    <li><b>R4-P17</b> is a <i>droid</i></li>
  </ul>
")
li <- html %>% html_elements("li")

# When applied to a node set, html_elements() returns all matching elements
# beneath any of the inputs, flattening results into a new node set.
li %>% html_elements("i")
#> {xml_nodeset (3)}
#> [1] <i>droid</i>
#> [2] <i>droid</i>
#> [3] <i>droid</i>

# When applied to a node set, html_element() always returns a vector the
# same length as the input, using a "missing" element where needed.
li %>% html_element("i")
#> {xml_nodeset (4)}
#> [1] <i>droid</i>
#> [2] <i>droid</i>
#> [3] <NA>
#> [4] <i>droid</i>
# and html_text() and html_attr() will return NA
li %>% html_element("i") %>% html_text2()
#> [1] "droid" "droid" NA      "droid"
li %>% html_element("span") %>% html_attr("class")
#> [1] "weight" "weight" "weight" NA

源代碼：R/selectors.R

相關用法

注：本文由純淨天空篩選整理自Hadley Wickham等大神的英文原創作品 Select elements from an HTML document。非經特殊聲明，原始代碼版權歸原作者所有，本譯文未經允許或授權，請勿轉載或複製。