[ACCEPTED]-Java: Best way to remove Javascript from HTML-xss

Accepted answer
Score: 11

JSoup has a simple method for sanitizing 9 HTML based on a whitelist. Check http://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer

It uses 8 a whitelist, which is safer then the blacklist 7 approach DeXSS uses. From the DeXSS page:

There 6 are still a number of known XSS attacks 5 that DeXSS does not yet detect.

A blacklist 4 only disallows known unsafe constructions, while 3 a whitelist only allows known safe constructions. So 2 unknown, possibly unsafe constructions will 1 only be protected against with a whitelist.

Score: 1

The easiest way would be to not have those 12 in the first place... It probably would 11 make sense to allow only very simple tags 10 to be used in free-form fields and to disallow 9 any kind of attributes.

Probably not the 8 answer you're going for, but in many cases 7 you only want to provide markup capabilities, not 6 a full editing suite.

Similarly, another 5 even easier approach would be to provide 4 a text-based syntax, like Markdown, for 3 editing. (not that many ways you can exploit 2 the SO edit area, for instance. Markdown 1 syntax + limited tag list without attributes).

Score: 1

You could try dom4j http://dom4j.sourceforge.net/dom4j-1.6.1/ This is a DOM parser 9 (as opposed to SAX) and allows you to easily 8 traverse and manipulate the DOM, removing 7 node attributes like onmouseover for example (or entire 6 elements like <script>), before writing back out 5 or streaming somewhere. Depending on how 4 wild your html is, you may need to clean 3 it up first - jtidy http://jtidy.sourceforge.net/ is good.

But obviously 2 doing all this involves some overhead if 1 you're doing this at page render time.

More Related questions