website - Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?] -

June 15, 2013

i have been breaking head against wall couple days now, thought ask community. want python script that, among other things, can hit 'accept' buttons on forms on websites in order download files. end, though, need access form.

this example of kind of file want download. know within it, there unnamed form action accept terms , download file. know div form can found in main-content div.

however, whenever beautifulsoup parse webpage, cannot main-content div. closest i've managed main_content link right before it, not provide me information through beautifulsoup's object.

here's bit of code script:

web_soup = soup(urllib2.urlopen(url)) parsed = list(urlparse(url)) ext = extr[1:] downloadable in web_soup.findall("a"):   encode = unicodedata.normalize('nfkd',downloadable.text).encode('utf-8','ignore')   if ext in str.lower(encode):     if downloadable['href'] in url:       return ("http://%s%s" % (parsed[1],downloadable['href'])) div in web_soup.findall("div"):   if div.has_key('class'):     print(div['class'])     if div['class'] == "main-content":       print("yep") return false

url name of url looking @ (so url posted earlier). extr type of file hoping download in form .extension, not relevant question. code relevant second loop, 1 attempting loop through divs. first bit of code(the first loop) code goes through grab download links in case (when url script given 'download link' marked file extension such .zip content type of text/html), feel free ignore it. added in context.

i hope provided enough detail, though sure did not. let me know if need more information on doing , happy oblige. thanks, stack.

here's code getting main-content div , form action:

import re import urllib2 bs4 import beautifulsoup soup   url = "http://www.cms.gov/apps/ama/license.asp?file=/mcrpartbdrugavgsalesprice/downloads/apr-13-asp-pricing-file.zip" web_soup = soup(urllib2.urlopen(url))  # main-content div main_div = web_soup.find(name="div", attrs={'class': 'main-content'}) print main_div  # form action form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')}) print form['action']

though, if need, can provide examples lxml, mechanize or selenium.

hope helps.

Search This Blog

KHS

website - Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?] -

Comments

Post a Comment

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -