本文整理汇总了Python中scrapy.spiders方法的典型用法代码示例。如果您正苦于以下问题:Python scrapy.spiders方法的具体用法?Python scrapy.spiders怎么用?Python scrapy.spiders使用的例子?那么, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类scrapy
的用法示例。
在下文中一共展示了scrapy.spiders方法的3个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。
示例1: start_requests
# 需要导入模块: import scrapy [as 别名]
# 或者: from scrapy import spiders [as 别名]
def start_requests(self):
"""This function generates the initial request of ArchiveSpider.
See 'http://doc.scrapy.org/en/latest/topics/spiders.html#\
scrapy.spiders.Spider.start_requests'.
The most import part of the function is to set a request meta,
'archive_meta', according to its site 'archive_rules'. The meta would
be used to parse article URLs from response and generate next request!
"""
for page in self.page_templates:
url = page.format(p_num=self.p_kw['start'])
meta = dict(archive_meta=dict(
last_urls=dict(),
p_num=self.p_kw['start'],
next_tries=0,
max_next_tries=self.p_kw['max_next_tries'],
page=page))
logger.debug('Page format meta info:\n%s', pprint.pformat(meta))
yield scrapy.Request(url, callback=self.parse, meta=meta)
示例2: is_this_request_from_same_traversal
# 需要导入模块: import scrapy [as 别名]
# 或者: from scrapy import spiders [as 别名]
def is_this_request_from_same_traversal(response, traversal):
"""
This mean the current request came from this traversal,
so we can put max pages condition on this, otherwise for different
traversals of different spiders, adding max_page doest make sense.
"""
traversal_id = traversal['traversal_id']
current_request_traversal_id = response.meta.get('current_request_traversal_id', None)
return current_request_traversal_id == traversal_id
示例3: __init__
# 需要导入模块: import scrapy [as 别名]
# 或者: from scrapy import spiders [as 别名]
def __init__(self, domains, urls, *args, **kwargs):
"""Constructor for FeedSpider.
Parameters
----------
domains : list
A list of domains for the site.
urls : list
A list of feed URLs of the site.
provider : string
The provider of RSS feed.
url_regex : string
URL pattern regular expression.
If you use this spider to store item into database, additional
keywords are required:
platform_id : int
The id of a platform instance.
session : object
An instance of SQLAlchemy session.
Other keywords are used to specify how to parse the XML, see
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders\
.XMLFeedSpider.
"""
self.platform_id = kwargs.pop('platform_id', None)
self.session = kwargs.pop('session', None)
self.url_regex = kwargs.pop('url_regex', None)
self.provider = kwargs.pop('provider', 'self')
self.iterator = kwargs.pop('iterator', 'iternodes')
self.itertag = kwargs.pop('iterator', 'item')
self.allowed_domains = domains
self.start_urls = urls
super(FeedSpider, self).__init__(*args, **kwargs)