Most of the solutions searched on the web are to replace or encode some sensitive characters using regular or other tool classes. If the client uses a rich text editor to submit content, then this approach is not suitable. Rich text editor submissions contain a lot of html tags and style attributes, which are key to WYSIWYG and cannot be encoded.
So we need another, more “intelligent” way of filtering. This approach allows us to keep the HTML tags and attributes that we consider safe, and even force some attributes to be added to the HTML node.
Jsoup: Java HTML Parser
The following is the introduction from the official website.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safelist, to prevent XSS attacks
- output tidy HTML
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
We can use its fourth feature to clean up illegal tags and attributes in client-side submission content.
Safe-lists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.
Creating Safelist Instances
Commonly used methods
public Safelist addTags(String... tags)
Add a list of allowed elements to a safelist. (If a tag is not allowed, it will be removed from the HTML.)
public Safelist addAttributes(String tag, String... attributes)
Add a list of allowed attributes to a tag. (If an attribute is not allowed on an element, it will be removed.) E.g.:
addAttributes("a", "href", "class")allows href and class attributeson a tags. To make an attribute valid for all tags, use the pseudo tag :all, e.g.
public Safelist addEnforcedAttribute(String tag, String attribute, String value)
Add an enforced attribute to a tag. An enforced attribute will always be added to the element. If the elementalready has the attribute set, it will be overridden with this value. E.g.:
addEnforcedAttribute("a", "rel", "nofollow")will make all a tags output as
<a href="..." rel="nofollow">
public Safelist addProtocols(String tag, String attribute, String... protocols)
Add allowed URL protocols for an element’s URL attribute. This restricts the possible values of the attribute toURLs with the defined protocol. E.g.:
addProtocols("a", "href", "ftp", "http", "https")To allow a link to an in-page URL anchor (i.e.
<a href="#anchor">, add a #: E.g.:
addProtocols("a", "href", "#")
With some knowledge of
Safelist, you can customize an Xss cleaner of your own. The following code demonstrates how it works.
Everything looks normal. The tags and attributes that are not in the safe list are cleaned up.