PHP: DOMDocument: Remove Unwanted Text from a Nested Element -

August 15, 2010

i have following xml document:

<?xml version="1.0" encoding="utf-8"?> <header level="2">my header</header> <ul>     <li>bulleted style text         <ul>             <li>                 <paragraph>1.sub bulleted style text</paragraph>             </li>         </ul>     </li> </ul> <ul>     <li>bulleted style text <strong>bold</strong>         <ul>             <li>                 <paragraph>2.sub bulleted <strong>bold</strong></paragraph>             </li>         </ul>     </li> </ul>

i need remove numbers preceeding sub-bulleted text. 1. , 2. in given example

this code have far:

<?php class mydocumentimporter {     const awkward_bullet_regex = '/(^[\s]?[\d]+[\.]{1})/i';      protected $xml_string = '<some_tag><header level="2">my header</header><ul><li>bulleted style text<ul><li><paragraph>1.sub bulleted style text</paragraph></li></ul></li></ul><ul><li>bulleted style text <strong>bold</strong><ul><li><paragraph>2.sub bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';      protected $dom;      public function processliststext( $loop = null ){          $this->dom = new domdocument('1.0', 'utf-8');          $this->dom->loadxml($this->xml_string);          if(!$loop){             //get li tags             $li_set = $this->dom->getelementsbytagname('li');         }         else{             $li_set = $loop;         }          foreach($li_set $li){              //check child nodes             if(! $li->haschildnodes() ){                 continue;             }              foreach($li->childnodes $child){                 if( $child->haschildnodes() ){                     //this li has children, maybe <strong> tag                     $this->processliststext( $child->childnodes );                 }                 if( ! ( $child instanceof domelement ) ){                     continue;                 }                 if( ( $child->localname != 'paragraph') ||  ( $child instanceof domtext )){                     continue;                 }                 if( preg_match(self::awkward_bullet_regex, $child->textcontent) == 0 ){                     continue;                 }                  $clean_content = preg_replace(self::awkward_bullet_regex, '', $child->textcontent);                  //set node empty                 $child->nodevalue = '';                  //add updated content node                 $child->appendchild($child->ownerdocument->createtextnode($clean_content));                  //$xml_output = $child->parentnode->ownerdocument->savexml($child);                 //var_dump($xml_output);              }         }     } }  $importer = new mydocumentimporter(); $importer->processliststext();

the issue can see $child->textcontent returns plain text content of node, , strips additional child tags. so:

<paragraph>2.sub bulleted <strong>bold</strong></paragraph>

becomes

<paragraph>sub bulleted bold</paragraph>

the <strong> tag no more.

i'm little stumped... can see way strip unwanted characters, , retain "inner child" <strong> tag?

the tag may not <strong>, hyperlink <a href="#">, or <emphasize>.

assuming xml parses, use xpath make queries lot easier:

$xp = new domxpath($this->dom);  foreach ($xp->query('//li/paragraph') $para) {         $para->firstchild->nodevalue = preg_replace('/^\s*\d+.\s*/', '', $para->firstchild->nodevalue); }

it text replacement on first text node instead of whole tag contents.

Search This Blog

KHS

PHP: DOMDocument: Remove Unwanted Text from a Nested Element -

Comments

Post a Comment

Popular posts from this blog

blackberry 10 - how to add multiple markers on the google map just by url? -

php - guestbook returning database data to flash -

java - Using an Integer ArrayList in Android -