web scraping - Scrape website with Ruby based on embedded CSS styles -
in past, have used nokogiri scrape websites using simple ruby script. current project, need scrape website uses inline css. can imagine, old website.
what possibilities have target specific elements on page based on inline css of elements? seems not possible nokogiri or have overlooked something?
update: example can found here. need main content without footnotes. latter have smaller font size , grouped below each section.
i'm going teach how fish. instead of trying find want, it's lot easier find don't want , remove it.
start code:
require 'nokogiri' require 'open-uri' url = 'http://www.eximsystems.com/laverdad/antiguo/gn/genesis.htm' footnote_accessors = [ 'span[style*="font-size: 8.0pt"]', 'span[style*="font-size:8.0pt"]', 'span[style*="font-size: 7.5pt"]', 'span[style*="font-size:7.5pt"]', 'font[size="1"]' ].join(',') doc = nokogiri.html(open(url)) doc.search(footnote_accessors).each |footnote| footnote.remove end file.write(file.basename(uri.parse(url).path), doc.to_html)
run it, open resulting html file in browser. scroll through file looking footnotes want remove. select part of text, use "inspect element", or whatever tool have find selected text in source of page. find unique in text makes possible isolate text want keep. instance, locate footnotes using font-sizes in <span>
, <font>
tags.
keep adding accessors footnote_accessors
array until have undesirable elements removed.
this code isn't complete, nor written tightly i'd sort of task, give idea how go particular task.
this version bit more flexible:
require 'nokogiri' require 'open-uri' url = 'http://www.eximsystems.com/laverdad/antiguo/gn/genesis.htm' footnote_accessors = [ 'span[style*="font-size: 8.0pt"]', 'span[style*="font-size:8.0pt"]', 'span[style*="font-size: 7.5pt"]', 'span[style*="font-size:7.5pt"]', 'font[size="1"]', ] doc = nokogiri.html(open(url)) footnote_accessors.each |accessor| doc.search(accessor).each |footnote| footnote.remove end end file.write(file.basename(uri.parse(url).path), doc.to_html)
the major difference previous version assumed entries in footnote_accessors
css. change xpath can used. code take little bit longer run entries iterated over, ability dig in xpath might make worthwhile you.
Comments
Post a Comment