How to extract the following HTML snippet with Python -


i have following python code:

def getaddress(text):     text = re.sub('\t', '', text)     text = re.sub('\n', '', text)     blocks = re.findall('<div class="result-box" itemscope itemtype="http://schema.org/localbusiness">([a-za-z0-9 ",;:\.#&_=()\'<>/\\\t\n\-]*)</span>follow company</span>', text)     name = ''     strasse = ''     locality = ''     plz = ''     region = ''     = 0      block in blocks:         names = re.findall('class="url">(.*)</a>', block)         strassen = re.findall('<span itemprop="streetaddress">([a-za-z0-9 ,;:\.&#]*)</span>', block)         localities = re.findall('<span itemprop="addresslocality">([a-za-z0-9 ,;:&]*)</span>', block)         plzs = re.findall('<span itemprop="postalcode">([0-9]*)</span>', block)         regions = re.findall('<span itemprop="addressregion">([a-za-z]*)</span>', block)          try:             name in names:                 name = str(name)                 name = re.sub('<[^<]+?>', '', name)                 break              strasse in strassen:                 strasse = str(strasse)                 strasse = re.sub('<[^<]+?>', '', strasse)                 break              locality in localities:                 locality = str(locality)                 locality = re.sub('<[^<]+?>', '', locality)                 break              plz in plzs:                 plz = str(plz)                 plz = re.sub('<[^<]+?>', '', plz)                 break              region in regions:                 region = str(region)                 region = re.sub('<[^<]+?>', '', region)                 break         except:             continue         print         = + 1          if plz == '':             plz = getzipcode(strasse, locality, region)         address = '"' + name + '"' + ';' + '"' + strasse + '";' + locality + ';' + str(plz) + ';' + region + '\n'          #savetocsv(address) 

i want filter out html snippet. snip gets repeated several times. want function return 1 entry each snippet. instead returns me 1 entry both snippets. have change?

<div class="result-box" itemscope itemtype="http://schema.org/localbusiness">         <div class="clear">             <h2 itemprop="name"><a href="http://www.manta.com/c/mxlk5yt/belgium-jewelers-corp" class="url">belgium jewelers corp</a></h2>           </div>         <div itemprop="address" itemscope itemtype="http://schema.org/postaladdress">               <span itemprop="addresslocality">lawrenceville</span> <span itemprop="addressregion">nj</span>         </div>          <a href="#" class="followcompany" data-emid="mxlk5yt" data-companyname="belgium jewelers corp" data-location="listingfollowbutton" data-location-page="megabrowse">             <span class="followmsg"><span class="followicon mrs"></span>follow company</span>             <span class="followingmsg"><span class="followicon mrs"></span>following</span>             <span class="unfollowmsg"><span class="followicon mrs"></span>unfollow company</span>         </a>            <p class="type">jewelry stores</p>      </div>     </li>       <li>        <div class="icons">         <ul>            </ul>     </div> 

please put down hammer; html not regular-expression shaped nail. regular expressions parse html complicated fast, , fragile, broken when html changes subtly.

use proper html parser instead. beautifulsoup make task trivial:

from bs4 import beautifulsoup  soup = beautifulsoup(text) block in soup.find_all('div', class_="result-box", itemtype="http://schema.org/localbusiness"):     print block.find('a', class_='url').string      street = block.find('span', itemprop="streetaddress")     if street:         print street.string      locality = block.find('span', itemprop="addresslocality")     if locality:         print locality.string      # .. etc. .. 

Comments

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

delphi - Dynamic file type icon -