本文整理汇总了Python中fetcher.Fetcher.raw_fetch_url方法的典型用法代码示例。如果您正苦于以下问题:Python Fetcher.raw_fetch_url方法的具体用法?Python Fetcher.raw_fetch_url怎么用?Python Fetcher.raw_fetch_url使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类fetcher.Fetcher
的用法示例。
在下文中一共展示了Fetcher.raw_fetch_url方法的1个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。
示例1: urls2fetch
# 需要导入模块: from fetcher import Fetcher [as 别名]
# 或者: from fetcher.Fetcher import raw_fetch_url [as 别名]
def urls2fetch(self, root, helper):
""" Returns a set of URLs to fetch. If the scraper helper class has
associated RSS feed URLs, these are used to acquire article URLs.
Otherwise, the URLs are found by scraping the root website and
searching for links to subpages. """
fetch_set = set()
feeds = helper.feeds
if feeds:
for feed_url in feeds:
logging.info("Fetching feed {0}".format(feed_url))
try:
d = feedparser.parse(feed_url)
except Exception as e:
logging.warning(
"Error fetching/parsing feed {0}: {1}".format(feed_url, str(e))
)
continue
for entry in d.entries:
if entry.link and not helper.skip_rss_entry(entry):
fetch_set.add(entry.link)
else:
# Fetch the root URL and scrape all child URLs
# that refer to the same domain suffix
logging.info("Fetching root {0}".format(root.url))
# Read the HTML document at the root URL
html_doc = Fetcher.raw_fetch_url(root.url)
if not html_doc:
logging.warning("Unable to fetch root {0}".format(root.url))
return
# Parse the HTML document
soup = Fetcher.make_soup(html_doc)
# Obtain the set of child URLs to fetch
fetch_set = Fetcher.children(root, soup)
return fetch_set