本文整理汇总了Python中scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor._process_links方法的典型用法代码示例。如果您正苦于以下问题:Python BaseSgmlLinkExtractor._process_links方法的具体用法?Python BaseSgmlLinkExtractor._process_links怎么用?Python BaseSgmlLinkExtractor._process_links使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor
的用法示例。
在下文中一共展示了BaseSgmlLinkExtractor._process_links方法的2个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。
示例1: _process_links
# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import _process_links [as 别名]
def _process_links(self, links):
links = [link for link in links if _is_valid_url(link.url)]
if self.allow_res:
links = [link for link in links if _matches(link.url, self.allow_res)]
if self.deny_res:
links = [link for link in links if not _matches(link.url, self.deny_res)]
if self.allow_domains:
links = [link for link in links if url_is_from_any_domain(link.url, self.allow_domains)]
if self.deny_domains:
links = [link for link in links if not url_is_from_any_domain(link.url, self.deny_domains)]
new_links = []
for link in links:
ASIN = link.url.split('/')[5]
if not self._ignore_identifier(ASIN):
log.msg("Found ASIN: "+ASIN,level=log.DEBUG)
link.url = "http://www.amazon.com/product-reviews/"+ASIN+"/ref%3Ddp_top_cm_cr_acr_txt?ie=UTF8&showViewpoints=0"
new_links.append(link)
links = new_links
if self.canonicalize:
for link in links:
link.url = canonicalize_url(link.url)
links = BaseSgmlLinkExtractor._process_links(self, links)
return links
示例2: _process_links
# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import _process_links [as 别名]
def _process_links(self, links):
links = [link for link in links if not self.check_url or _is_valid_url(link.url)]
if self.allow_res:
links = [link for link in links if _matches(link.url, self.allow_res)]
if self.deny_res:
links = [link for link in links if not _matches(link.url, self.deny_res)]
if self.allow_domains:
links = [link for link in links if url_is_from_any_domain(link.url, self.allow_domains)]
if self.deny_domains:
links = [link for link in links if not url_is_from_any_domain(link.url, self.deny_domains)]
if self.canonicalize:
for link in links:
link.url = canonicalize_url(link.url)
links = BaseSgmlLinkExtractor._process_links(self, links)
return links