How to limit number of followed pages per site in Python Scrapy -

April 15, 2014

i trying build spider efficiently scrape text information many websites. since python user referred scrapy. however, in order avoid scraping huge websites, want limit spider scrape no more 20 pages of "depth" per website. here spider:

class downloadspider(crawlspider):     name = 'downloader'     download_path = '/home/myprojects/crawler'     rules = (rule(sgmllinkextractor(), callback='parse_item', follow=true),)      def __init__(self, *args, **kwargs):         super(downloadspider, self).__init__(*args, **kwargs)         self.urls_file_path = [kwargs.get('urls_file')]         data = open(self.urls_file_path[0], 'r').readlines()         self.allowed_domains = [urlparse(i).hostname.strip() in data]          self.start_urls = ['http://' + domain domain in self.allowed_domains]      def parse_start_url(self, response):         return self.parse_item(response)      def parse_item(self, response):         self.fname = self.download_path + urlparse(response.url).hostname.strip()         open(str(self.fname)+ '.txt', 'a').write(response.url)         open(str(self.fname)+ '.txt', 'a').write('\n')

urls_file path text file urls. have set max depth in settings file. here problem: if set closespider_pagecount exception closes spider when total number of scraped pages (regardless site) reaches exception value. however, need stop scraping when have scraped 20 pages each url. tried keeping count variable self.parsed_number += 1, didn't work either -- seems scrapy doesn't go url url mixes them up. advice appreciated !

i'd make per-class variable, initialize stats = defaultdict(int) , increment self.stats[response.url] (or may key tuple (website, depth) in case) in parse_item.

this how imagine - should work in theory. let me know if need example.

fyi, can extract base url , calculate depth of urlparse.urlparse (see docs).

Search This Blog

KHS

How to limit number of followed pages per site in Python Scrapy -

Comments

Post a Comment

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -