当前位置: 首页>>代码示例>>Python>>正文


Python BaseSgmlLinkExtractor.extract_links方法代码示例

本文整理汇总了Python中scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor.extract_links方法的典型用法代码示例。如果您正苦于以下问题:Python BaseSgmlLinkExtractor.extract_links方法的具体用法?Python BaseSgmlLinkExtractor.extract_links怎么用?Python BaseSgmlLinkExtractor.extract_links使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor的用法示例。


在下文中一共展示了BaseSgmlLinkExtractor.extract_links方法的6个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。

示例1: test_extraction_encoding

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import extract_links [as 别名]
    def test_extraction_encoding(self):
        body = get_testdata('link_extractor', 'linkextractor_noenc.html')
        response_utf8 = HtmlResponse(
            url='http://example.com/utf8', body=body, headers={'Content-Type': ['text/html; charset=utf-8']})
        response_noenc = HtmlResponse(
            url='http://example.com/noenc', body=body)
        body = get_testdata('link_extractor', 'linkextractor_latin1.html')
        response_latin1 = HtmlResponse(
            url='http://example.com/latin1', body=body)

        lx = BaseSgmlLinkExtractor()
        self.assertEqual(lx.extract_links(response_utf8), [
            Link(url='http://example.com/sample_%C3%B1.html', text=''),
            Link(url='http://example.com/sample_%E2%82%AC.html',
                 text='sample \xe2\x82\xac text'.decode('utf-8')),
        ])

        self.assertEqual(lx.extract_links(response_noenc), [
            Link(url='http://example.com/sample_%C3%B1.html', text=''),
            Link(url='http://example.com/sample_%E2%82%AC.html',
                 text='sample \xe2\x82\xac text'.decode('utf-8')),
        ])

        self.assertEqual(lx.extract_links(response_latin1), [
            Link(url='http://example.com/sample_%F1.html', text=''),
            Link(url='http://example.com/sample_%E1.html',
                 text='sample \xe1 text'.decode('latin1')),
        ])
开发者ID:pyarnold,项目名称:scrapy,代码行数:30,代码来源:test_contrib_linkextractors.py

示例2: test_base_url

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import extract_links [as 别名]
    def test_base_url(self):
        html = """<html><head><title>Page title<title><base href="http://otherdomain.com/base/" />
        <body><p><a href="item/12.html">Item 12</a></p>
        </body></html>"""
        response = HtmlResponse(
            "http://example.org/somepage/index.html", body=html)

        lx = BaseSgmlLinkExtractor()  # default: tag=a, attr=href
        self.assertEqual(lx.extract_links(response),
                         [Link(url='http://otherdomain.com/base/item/12.html', text='Item 12')])

        # base url is an absolute path and relative to host
        html = """<html><head><title>Page title<title><base href="/" />
        <body><p><a href="item/12.html">Item 12</a></p></body></html>"""
        response = HtmlResponse(
            "https://example.org/somepage/index.html", body=html)
        self.assertEqual(lx.extract_links(response),
                         [Link(url='https://example.org/item/12.html', text='Item 12')])

        # base url has no scheme
        html = """<html><head><title>Page title<title><base href="//noschemedomain.com/path/to/" />
        <body><p><a href="item/12.html">Item 12</a></p></body></html>"""
        response = HtmlResponse(
            "https://example.org/somepage/index.html", body=html)
        self.assertEqual(lx.extract_links(response),
                         [Link(url='https://noschemedomain.com/path/to/item/12.html', text='Item 12')])
开发者ID:pyarnold,项目名称:scrapy,代码行数:28,代码来源:test_contrib_linkextractors.py

示例3: test_extraction_encoding

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import extract_links [as 别名]
    def test_extraction_encoding(self):
        body = get_testdata("link_extractor", "linkextractor_noenc.html")
        response_utf8 = HtmlResponse(
            url="http://example.com/utf8", body=body, headers={"Content-Type": ["text/html; charset=utf-8"]}
        )
        response_noenc = HtmlResponse(url="http://example.com/noenc", body=body)
        body = get_testdata("link_extractor", "linkextractor_latin1.html")
        response_latin1 = HtmlResponse(url="http://example.com/latin1", body=body)

        lx = BaseSgmlLinkExtractor()
        self.assertEqual(
            lx.extract_links(response_utf8),
            [
                Link(url="http://example.com/sample_%C3%B1.html", text=""),
                Link(url="http://example.com/sample_%E2%82%AC.html", text="sample \xe2\x82\xac text".decode("utf-8")),
            ],
        )

        self.assertEqual(
            lx.extract_links(response_noenc),
            [
                Link(url="http://example.com/sample_%C3%B1.html", text=""),
                Link(url="http://example.com/sample_%E2%82%AC.html", text="sample \xe2\x82\xac text".decode("utf-8")),
            ],
        )

        self.assertEqual(
            lx.extract_links(response_latin1),
            [
                Link(url="http://example.com/sample_%F1.html", text=""),
                Link(url="http://example.com/sample_%E1.html", text="sample \xe1 text".decode("latin1")),
            ],
        )
开发者ID:serkanh,项目名称:scrapy,代码行数:35,代码来源:test_contrib_linkextractors.py

示例4: test_link_text_wrong_encoding

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import extract_links [as 别名]
 def test_link_text_wrong_encoding(self):
     html = """<body><p><a href="item/12.html">Wrong: \xed</a></p></body></html>"""
     response = HtmlResponse("http://www.example.com", body=html, encoding='utf-8')
     lx = BaseSgmlLinkExtractor()
     self.assertEqual(lx.extract_links(response), [
         Link(url='http://www.example.com/item/12.html', text=u'Wrong: \ufffd'),
     ])
开发者ID:505555998,项目名称:scrapy,代码行数:9,代码来源:test_contrib_linkextractors.py

示例5: test_base_url

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import extract_links [as 别名]
    def test_base_url(self):
        html = """<html><head><title>Page title<title><base href="http://otherdomain.com/base/" />
        <body><p><a href="item/12.html">Item 12</a></p>
        </body></html>"""
        response = HtmlResponse("http://example.org/somepage/index.html", body=html)

        lx = BaseSgmlLinkExtractor()  # default: tag=a, attr=href
        self.assertEqual(
            lx.extract_links(response), [Link(url="http://otherdomain.com/base/item/12.html", text="Item 12")]
        )
开发者ID:serkanh,项目名称:scrapy,代码行数:12,代码来源:test_contrib_linkextractors.py

示例6: test_basic

# 需要导入模块: from scrapy.contrib.linkextractors.sgml import BaseSgmlLinkExtractor [as 别名]
# 或者: from scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor import extract_links [as 别名]
    def test_basic(self):
        html = """<html><head><title>Page title<title>
        <body><p><a href="item/12.html">Item 12</a></p>
        <p><a href="/about.html">About us</a></p>
        <img src="/logo.png" alt="Company logo (not a link)" />
        <p><a href="../othercat.html">Other category</a></p>
        <p><a href="/" /></p>
        </body></html>"""
        response = HtmlResponse("http://example.org/somepage/index.html", body=html)

        lx = BaseSgmlLinkExtractor()  # default: tag=a, attr=href
        self.assertEqual(lx.extract_links(response),
                         [Link(url='http://example.org/somepage/item/12.html', text='Item 12'), 
                          Link(url='http://example.org/about.html', text='About us'),
                          Link(url='http://example.org/othercat.html', text='Other category'), 
                          Link(url='http://example.org/', text='')])
开发者ID:kenzouyeh,项目名称:scrapy,代码行数:18,代码来源:test_contrib_linkextractors.py


注:本文中的scrapy.contrib.linkextractors.sgml.BaseSgmlLinkExtractor.extract_links方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。