Regex to parse html for sentences? -

September 15, 2014

i know html:parser thing , reading around, i've realized trying parse html regex suboptimal way of doing things, perl class i'm trying use regular expressions (hopefully single match) identify , store sentences saved html doc. want able calculate number of sentences, words/sentence , average length of words on page.

for now, i've tried isolate things follow ">" , precede ". " see if isolates, can't code run, when manipulating regular expression. i'm not sure if issue in regex, somewhere else or both. appreciated!

#!/usr/bin/perl #new use cgi qw(:standard); print header;  open file, "< sample.html "; $html = join('', <file>); close file;  print "<pre>";  ###main program### &sentences;  ###sentence identifier sub###  sub sentences { @sentences; while ($html =~ />[^<]\. /gis) {     push @sentences, $1; } #for debugging, comment out when running         print join("\n",@sentences); }  print "</pre>";

your regex should />[^<]*?./gis

the *? means match 0 or more non greedy. stood regex match single non < character followed period , space. way match non < until first period.

there may other problems.

now read this

Search This Blog

KHS

Regex to parse html for sentences? -

Comments

Post a Comment

Popular posts from this blog

python - How to create a legend for 3D bar in matplotlib? -

java - Multi-Label Document Classification -

php - Dynamic url re-writing using htaccess -