web scraping - Scrape website with Ruby based on embedded CSS styles -

May 15, 2012

in past, have used nokogiri scrape websites using simple ruby script. current project, need scrape website uses inline css. can imagine, old website.

what possibilities have target specific elements on page based on inline css of elements? seems not possible nokogiri or have overlooked something?

update: example can found here. need main content without footnotes. latter have smaller font size , grouped below each section.

i'm going teach how fish. instead of trying find want, it's lot easier find don't want , remove it.

start code:

require 'nokogiri' require 'open-uri'  url = 'http://www.eximsystems.com/laverdad/antiguo/gn/genesis.htm' footnote_accessors = [   'span[style*="font-size: 8.0pt"]',   'span[style*="font-size:8.0pt"]',   'span[style*="font-size: 7.5pt"]',   'span[style*="font-size:7.5pt"]',   'font[size="1"]' ].join(',')  doc = nokogiri.html(open(url)) doc.search(footnote_accessors).each |footnote|   footnote.remove end  file.write(file.basename(uri.parse(url).path), doc.to_html)

run it, open resulting html file in browser. scroll through file looking footnotes want remove. select part of text, use "inspect element", or whatever tool have find selected text in source of page. find unique in text makes possible isolate text want keep. instance, locate footnotes using font-sizes in <span> , <font> tags.

keep adding accessors footnote_accessors array until have undesirable elements removed.

this code isn't complete, nor written tightly i'd sort of task, give idea how go particular task.

this version bit more flexible:

require 'nokogiri' require 'open-uri'  url = 'http://www.eximsystems.com/laverdad/antiguo/gn/genesis.htm' footnote_accessors = [   'span[style*="font-size: 8.0pt"]',   'span[style*="font-size:8.0pt"]',   'span[style*="font-size: 7.5pt"]',   'span[style*="font-size:7.5pt"]',   'font[size="1"]', ]  doc = nokogiri.html(open(url)) footnote_accessors.each |accessor|   doc.search(accessor).each |footnote|     footnote.remove   end end  file.write(file.basename(uri.parse(url).path), doc.to_html)

the major difference previous version assumed entries in footnote_accessors css. change xpath can used. code take little bit longer run entries iterated over, ability dig in xpath might make worthwhile you.

Search This Blog

KHS

web scraping - Scrape website with Ruby based on embedded CSS styles -

Comments

Post a Comment

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -