Regex to parse html for sentences? -
i know html:parser thing , reading around, i've realized trying parse html regex suboptimal way of doing things, perl class i'm trying use regular expressions (hopefully single match) identify , store sentences saved html doc. want able calculate number of sentences, words/sentence , average length of words on page.
for now, i've tried isolate things follow ">" , precede ". " see if isolates, can't code run, when manipulating regular expression. i'm not sure if issue in regex, somewhere else or both. appreciated!
#!/usr/bin/perl #new use cgi qw(:standard); print header; open file, "< sample.html "; $html = join('', <file>); close file; print "<pre>"; ###main program### &sentences; ###sentence identifier sub### sub sentences { @sentences; while ($html =~ />[^<]\. /gis) { push @sentences, $1; } #for debugging, comment out when running print join("\n",@sentences); } print "</pre>";
your regex should />[^<]*?./gis
the *? means match 0 or more non greedy. stood regex match single non < character followed period , space. way match non < until first period.
there may other problems.
now read this
Comments
Post a Comment