[ACCEPTED]-Regular Expression to Extract HTML Body Content-xhtml
Would this work ?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to 5 add the necessary \s
in order to take into 4 account < body ...>
(element with spaces), as in:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On 3 second thought, I am not sure why I needed 2 a negative look-ahead... This should also 1 work (for a well-formed xhtml document):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
XHTML would be more easily parsed with an 26 XML parser, than with a regex. I know it's 25 not what youre asking, but an XML parser 24 would be able to quickly navigate to the 23 body node and give you back its content 22 without any tag mapping problems that the 21 regex is giving you.
EDIT: In response to 20 a comment here; that an XML parser is too 19 slow.
There are two kinds of XML parser, one 18 called DOM is big and heavy and easy and 17 friendly, it builds a tree out of the document 16 before you can do anything. The other is 15 called SAX and is fast and light and more 14 work, it reads the file sequentially. You 13 will want SAX to find the Body tag.
The DOM 12 method is good for multiple uses, pulling 11 tags and finding who is what's child. The 10 SAX parser reads across the file in order 9 and qill quickly get the information you 8 are after. The Regex won't be any faster 7 than a SAX parser, because they both simply 6 walk across the file and pattern match, with 5 the exception that the regex won't quit 4 looking after it has found a body tag, because 3 regex has no built in knowledge of XML. In 2 fact, your SAX parser probably uses small 1 pieces of regex to find each tag.
String toMatch="aaaaaaaaaaabcxx sldjfkvnlkfd <body>i m avinash</body>";
Pattern pattern=Pattern.compile(".*?<body.*?>(.*?)</body>.*?");
Matcher matcher=pattern.matcher(toMatch);
if(matcher.matches()) {
System.out.println(matcher.group(1));
}
0
/<body[^>]*>(.*)</body>/s
replace with
\1
0
Why can't you just split it by
</{0,1}body[^>]*>
and take 2 the second string? I believe it will be 1 much faster than looking for a huge regexp.
Match the first body tag: <\s*body.*?>
Match the last 7 body tag: <\s*/\s*body.*?>
(note: we account for spaces in 6 the middle of the tags, which is completely 5 valid markup btw)
Combine them together like 4 this and you will get everything in-between, including 3 the body tags: <\s*body.*?>.*?<\s*/\s*body.*?>
. And make sure you are using 2 Singleline
mode which will ignore line breaks.
This 1 works in VB.NET, and hopefully others too!
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.