PHP: DOMDocument: Remove Unwanted Text from a Nested Element -
i have following xml document:
<?xml version="1.0" encoding="utf-8"?> <header level="2">my header</header> <ul> <li>bulleted style text <ul> <li> <paragraph>1.sub bulleted style text</paragraph> </li> </ul> </li> </ul> <ul> <li>bulleted style text <strong>bold</strong> <ul> <li> <paragraph>2.sub bulleted <strong>bold</strong></paragraph> </li> </ul> </li> </ul>
i need remove numbers preceeding sub-bulleted text. 1. , 2. in given example
this code have far:
<?php class mydocumentimporter { const awkward_bullet_regex = '/(^[\s]?[\d]+[\.]{1})/i'; protected $xml_string = '<some_tag><header level="2">my header</header><ul><li>bulleted style text<ul><li><paragraph>1.sub bulleted style text</paragraph></li></ul></li></ul><ul><li>bulleted style text <strong>bold</strong><ul><li><paragraph>2.sub bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>'; protected $dom; public function processliststext( $loop = null ){ $this->dom = new domdocument('1.0', 'utf-8'); $this->dom->loadxml($this->xml_string); if(!$loop){ //get li tags $li_set = $this->dom->getelementsbytagname('li'); } else{ $li_set = $loop; } foreach($li_set $li){ //check child nodes if(! $li->haschildnodes() ){ continue; } foreach($li->childnodes $child){ if( $child->haschildnodes() ){ //this li has children, maybe <strong> tag $this->processliststext( $child->childnodes ); } if( ! ( $child instanceof domelement ) ){ continue; } if( ( $child->localname != 'paragraph') || ( $child instanceof domtext )){ continue; } if( preg_match(self::awkward_bullet_regex, $child->textcontent) == 0 ){ continue; } $clean_content = preg_replace(self::awkward_bullet_regex, '', $child->textcontent); //set node empty $child->nodevalue = ''; //add updated content node $child->appendchild($child->ownerdocument->createtextnode($clean_content)); //$xml_output = $child->parentnode->ownerdocument->savexml($child); //var_dump($xml_output); } } } } $importer = new mydocumentimporter(); $importer->processliststext();
the issue can see $child->textcontent
returns plain text content of node, , strips additional child tags. so:
<paragraph>2.sub bulleted <strong>bold</strong></paragraph>
becomes
<paragraph>sub bulleted bold</paragraph>
the <strong>
tag no more.
i'm little stumped... can see way strip unwanted characters, , retain "inner child" <strong>
tag?
the tag may not <strong>
, hyperlink <a href="#">
, or <emphasize>
.
assuming xml parses, use xpath make queries lot easier:
$xp = new domxpath($this->dom); foreach ($xp->query('//li/paragraph') $para) { $para->firstchild->nodevalue = preg_replace('/^\s*\d+.\s*/', '', $para->firstchild->nodevalue); }
it text replacement on first text node instead of whole tag contents.
Comments
Post a Comment