How to extract the following HTML snippet with Python -
i have following python code:
def getaddress(text): text = re.sub('\t', '', text) text = re.sub('\n', '', text) blocks = re.findall('<div class="result-box" itemscope itemtype="http://schema.org/localbusiness">([a-za-z0-9 ",;:\.#&_=()\'<>/\\\t\n\-]*)</span>follow company</span>', text) name = '' strasse = '' locality = '' plz = '' region = '' = 0 block in blocks: names = re.findall('class="url">(.*)</a>', block) strassen = re.findall('<span itemprop="streetaddress">([a-za-z0-9 ,;:\.&#]*)</span>', block) localities = re.findall('<span itemprop="addresslocality">([a-za-z0-9 ,;:&]*)</span>', block) plzs = re.findall('<span itemprop="postalcode">([0-9]*)</span>', block) regions = re.findall('<span itemprop="addressregion">([a-za-z]*)</span>', block) try: name in names: name = str(name) name = re.sub('<[^<]+?>', '', name) break strasse in strassen: strasse = str(strasse) strasse = re.sub('<[^<]+?>', '', strasse) break locality in localities: locality = str(locality) locality = re.sub('<[^<]+?>', '', locality) break plz in plzs: plz = str(plz) plz = re.sub('<[^<]+?>', '', plz) break region in regions: region = str(region) region = re.sub('<[^<]+?>', '', region) break except: continue print = + 1 if plz == '': plz = getzipcode(strasse, locality, region) address = '"' + name + '"' + ';' + '"' + strasse + '";' + locality + ';' + str(plz) + ';' + region + '\n' #savetocsv(address)
i want filter out html snippet. snip gets repeated several times. want function return 1 entry each snippet. instead returns me 1 entry both snippets. have change?
<div class="result-box" itemscope itemtype="http://schema.org/localbusiness"> <div class="clear"> <h2 itemprop="name"><a href="http://www.manta.com/c/mxlk5yt/belgium-jewelers-corp" class="url">belgium jewelers corp</a></h2> </div> <div itemprop="address" itemscope itemtype="http://schema.org/postaladdress"> <span itemprop="addresslocality">lawrenceville</span> <span itemprop="addressregion">nj</span> </div> <a href="#" class="followcompany" data-emid="mxlk5yt" data-companyname="belgium jewelers corp" data-location="listingfollowbutton" data-location-page="megabrowse"> <span class="followmsg"><span class="followicon mrs"></span>follow company</span> <span class="followingmsg"><span class="followicon mrs"></span>following</span> <span class="unfollowmsg"><span class="followicon mrs"></span>unfollow company</span> </a> <p class="type">jewelry stores</p> </div> </li> <li> <div class="icons"> <ul> </ul> </div>
please put down hammer; html not regular-expression shaped nail. regular expressions parse html complicated fast, , fragile, broken when html changes subtly.
use proper html parser instead. beautifulsoup make task trivial:
from bs4 import beautifulsoup soup = beautifulsoup(text) block in soup.find_all('div', class_="result-box", itemtype="http://schema.org/localbusiness"): print block.find('a', class_='url').string street = block.find('span', itemprop="streetaddress") if street: print street.string locality = block.find('span', itemprop="addresslocality") if locality: print locality.string # .. etc. ..
Comments
Post a Comment