当前位置: 首页>>代码示例>>Python>>正文


Python BaseSgmlLinkExtractor._process_links方法代码示例

本文整理汇总了Python中scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor._process_links方法的典型用法代码示例。如果您正苦于以下问题:Python BaseSgmlLinkExtractor._process_links方法的具体用法?Python BaseSgmlLinkExtractor._process_links怎么用?Python BaseSgmlLinkExtractor._process_links使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor的用法示例。


在下文中一共展示了BaseSgmlLinkExtractor._process_links方法的2个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。

示例1: _process_links

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import _process_links [as 别名]
    def _process_links(self, links):
        links = [link for link in links if _is_valid_url(link.url)]

        if self.allow_res:
            links = [link for link in links if _matches(link.url, self.allow_res)]
        if self.deny_res:
            links = [link for link in links if not _matches(link.url, self.deny_res)]
        if self.allow_domains:
            links = [link for link in links if url_is_from_any_domain(link.url, self.allow_domains)]
        if self.deny_domains:
            links = [link for link in links if not url_is_from_any_domain(link.url, self.deny_domains)]

        new_links = []
        for link in links:
            ASIN = link.url.split('/')[5]
            if not self._ignore_identifier(ASIN):
                log.msg("Found ASIN: "+ASIN,level=log.DEBUG)
                link.url = "http://www.amazon.com/product-reviews/"+ASIN+"/ref%3Ddp_top_cm_cr_acr_txt?ie=UTF8&showViewpoints=0"
                new_links.append(link)

        links = new_links

        if self.canonicalize:
            for link in links:
                link.url = canonicalize_url(link.url)

        links = BaseSgmlLinkExtractor._process_links(self, links)
        return links
开发者ID:shahin,项目名称:hippolyte,代码行数:30,代码来源:linkextractors.py

示例2: _process_links

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import _process_links [as 别名]
    def _process_links(self, links):
        links = [link for link in links if not self.check_url or _is_valid_url(link.url)]

        if self.allow_res:
            links = [link for link in links if _matches(link.url, self.allow_res)]
        if self.deny_res:
            links = [link for link in links if not _matches(link.url, self.deny_res)]
        if self.allow_domains:
            links = [link for link in links if url_is_from_any_domain(link.url, self.allow_domains)]
        if self.deny_domains:
            links = [link for link in links if not url_is_from_any_domain(link.url, self.deny_domains)]

        if self.canonicalize:
            for link in links:
                link.url = canonicalize_url(link.url)

        links = BaseSgmlLinkExtractor._process_links(self, links)
        return links
开发者ID:qpwang,项目名称:CareerTalkCrawler,代码行数:20,代码来源:linkextractor.py


注:本文中的scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor._process_links方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。