How to limit number of followed pages per site in Python Scrapy -
i trying build spider efficiently scrape text information many websites. since python user referred scrapy. however, in order avoid scraping huge websites, want limit spider scrape no more 20 pages of "depth" per website. here spider:
class downloadspider(crawlspider): name = 'downloader' download_path = '/home/myprojects/crawler' rules = (rule(sgmllinkextractor(), callback='parse_item', follow=true),) def __init__(self, *args, **kwargs): super(downloadspider, self).__init__(*args, **kwargs) self.urls_file_path = [kwargs.get('urls_file')] data = open(self.urls_file_path[0], 'r').readlines() self.allowed_domains = [urlparse(i).hostname.strip() in data] self.start_urls = ['http://' + domain domain in self.allowed_domains] def parse_start_url(self, response): return self.parse_item(response) def parse_item(self, response): self.fname = self.download_path + urlparse(response.url).hostname.strip() open(str(self.fname)+ '.txt', 'a').write(response.url) open(str(self.fname)+ '.txt', 'a').write('\n')
urls_file path text file urls. have set max depth in settings file. here problem: if set closespider_pagecount
exception closes spider when total number of scraped pages (regardless site) reaches exception value. however, need stop scraping when have scraped 20 pages each url. tried keeping count variable self.parsed_number += 1, didn't work either -- seems scrapy doesn't go url url mixes them up. advice appreciated !
i'd make per-class variable, initialize stats = defaultdict(int)
, increment self.stats[response.url]
(or may key tuple (website, depth)
in case) in parse_item
.
this how imagine - should work in theory. let me know if need example.
fyi, can extract base url , calculate depth of urlparse.urlparse
(see docs).
Comments
Post a Comment