[ACCEPTED]-How can I strip HTML in a string using Perl?-strip
Assuming the code is valid HTML (no stray 3 < or > operators)
$htmlCode =~ s|<.+?>||g;
If you need to remove 2 only bolds, h1's and br's
$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g
And you might want 1 to consider the HTML::Strip module
The most correct way (albeit not the 19 fastest) is to use HTML::Parser from CPAN. Another 18 mostly correct way is to use HTML::FormatText 17 which not only removes HTML but also attempts 16 to do a little simple formatting of the 15 resulting plain text.
Many folks attempt 14 a simple-minded regular expression approach, like 13 s/<.*?>//g, but that fails in many 12 cases because the tags may continue over 11 line breaks, they may contain quoted angle-brackets, or 10 HTML comment may be present. Plus, folks 9 forget to convert entities--like < for 8 example.
Here's one "simple-minded" approach, that 7 works for most files:
#!/usr/bin/perl -p0777 s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
If you want a more 6 complete solution, see the 3-stage striphtml 5 program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .
Here are some tricky cases 4 that you should think about when picking 3 a solution:
<IMG SRC = "foo.gif" ALT = "A > B"> <IMG SRC = "foo.gif" ALT = "A > B"> <!-- <A comment> --> <script>if (a<b && a>c)</script> <# Just data #> <![INCLUDE CDATA [ >>>>>>>>>>>> ]]>
If HTML comments include other 2 tags, those solutions would also break on 1 text like this:
<!-- This section commented out. <B>You can't see me!</B> -->
You should definitely have a look at the 4 HTML::Restrict which allows you to strip away or restrict 3 the HTML tags allowed. A minimal example 2 that strips away all HTML tags:
use HTML::Restrict; my $hr = HTML::Restrict->new(); my $processed = $hr->process('<b>i am bold</b>'); # returns 'i am bold'
I would recommend 1 to stay away from HTML::Strip because it breaks utf8 encoding.
More Related questions