XSS

An XSS attack is usually a clever way of injecting malicious command code into a web page by exploiting a vulnerability left in the web page’s development, causing the user to load and execute a malicious web program created by the attacker. These malicious web programs are usually JavaScript, but can actually include Java, VBScript, ActiveX, Flash, or even plain HTML, and when the attack is successful, the attacker may gain access to a variety of things including, but not limited to, higher privileges (such as performing certain actions), private web content, sessions, and cookies.

Most of the solutions searched on the web are to replace or encode some sensitive characters using regular or other tool classes. If the client uses a rich text editor to submit content, then this approach is not suitable. Rich text editor submissions contain a lot of html tags and style attributes, which are key to WYSIWYG and cannot be encoded.

So we need another, more “intelligent” way of filtering. This approach allows us to keep the HTML tags and attributes that we consider safe, and even force some attributes to be added to the HTML node.

Jsoup: Java HTML Parser

The following is the introduction from the official website.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

  1. scrape and parse HTML from a URL, file, or string
  2. find and extract data, using DOM traversal or CSS selectors
  3. manipulate the HTML elements, attributes, and text
  4. clean user-submitted content against a safelist, to prevent XSS attacks
  5. output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

We can use its fourth feature to clean up illegal tags and attributes in client-side submission content.

Maven

1
2
3
4
5
6
<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.14.3</version>
</dependency>

Safelist

org.jsoup.safety.Safelist

Safe-lists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.

Creating Safelist Instances

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
//This safelist allows only text nodes: all HTML will be stripped.
public static Safelist none()

// This safelist allows only simple text formatting: b, em, i, strong, u. All other HTML (tags andattributes) will be removed.
public static Safelist simpleText()

// This safelist allows a fuller range of text nodes: a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li,ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul, and appropriate attributes. 
//
// Links (a elements) can point to http, https, ftp, mailto, and have an enforced rel=nofollow attribute. 
// 
// Does not allow images. 
public static Safelist basic()

// This safelist allows the same text tags as basic, and also allows img tags, with appropriateattributes, with src pointing to http or https.
public static Safelist basicWithImages()

// This safelist allows a full range of text and structural body HTML: a, b, blockquote, br, caption, cite,code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub,sup, table, tbody, td, tfoot, th, thead, tr, u, ul 
//
//Links do not have an enforced rel=nofollow attribute, but you can add that if desired. 
public static Safelist relaxed()

Commonly used methods

  • public Safelist addTags(String... tags)

    Add a list of allowed elements to a safelist. (If a tag is not allowed, it will be removed from the HTML.)

  • public Safelist addAttributes(String tag, String... attributes)

    Add a list of allowed attributes to a tag. (If an attribute is not allowed on an element, it will be removed.) E.g.: addAttributes("a", "href", "class") allows href and class attributeson a tags. To make an attribute valid for all tags, use the pseudo tag :all, e.g. addAttributes(":all", "class").

  • public Safelist addEnforcedAttribute(String tag, String attribute, String value)

    Add an enforced attribute to a tag. An enforced attribute will always be added to the element. If the elementalready has the attribute set, it will be overridden with this value. E.g.: addEnforcedAttribute("a", "rel", "nofollow") will make all a tags output as <a href="..." rel="nofollow">

  • public Safelist addProtocols(String tag, String attribute, String... protocols)

    Add allowed URL protocols for an element’s URL attribute. This restricts the possible values of the attribute toURLs with the defined protocol. E.g.: addProtocols("a", "href", "ftp", "http", "https") To allow a link to an in-page URL anchor (i.e. <a href="#anchor">, add a #: E.g.: addProtocols("a", "href", "#")

XssCleaner

With some knowledge of Safelist, you can customize an Xss cleaner of your own. The following code demonstrates how it works.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.safety.Safelist;

import com.google.gson.Gson;
import com.google.gson.GsonBuilder;

import lombok.AllArgsConstructor;
import lombok.Builder;
import lombok.Data;
import lombok.NoArgsConstructor;

@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class XssCleaner {
    private List<String> safeTags;
    private List<String> safeAttributes;    // Attributes allowed for all tags
    private Map<String, List<String>> safeTagAttributes;
    private Map<String, Map<String, String>> enforcedAttributes;
    private Map<String, Map<String, List<String>>> safeProtocols;

    public String clean(String content) {

        if (content == null) {
            return content;
        }

        Safelist safelist = Safelist.none();
        
        safelist.addTags(toArray(safeTags));
        safelist.addAttributes(":all", toArray(safeAttributes));
        safeTagAttributes.entrySet().forEach(entry -> {
            safelist.addAttributes(entry.getKey(), toArray(entry.getValue()));
        });
        enforcedAttributes.entrySet().forEach(entry -> {
            entry.getValue().entrySet().forEach(item -> {
                safelist.addEnforcedAttribute(entry.getKey(), item.getKey(), item.getValue());
            });
        });
        safeProtocols.entrySet().forEach(entry -> {
            entry.getValue().entrySet().forEach(item -> {
                safelist.addProtocols(entry.getKey(), item.getKey(), toArray(item.getValue()));
            });
        });

        return Jsoup.clean(content, "", safelist,
                new Document.OutputSettings().prettyPrint(false).charset(StandardCharsets.UTF_8));
    }

    private String[] toArray(List<String> list) {
        return list == null ? new String[0] : list.stream().toArray(String[]::new);
    }

    public String json() {
        return new GsonBuilder().setPrettyPrinting().create().toJson(this);
    }

    public static XssCleaner fromJson(String json) {
        return new Gson().fromJson(json, XssCleaner.class);
    }
}
Test
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
package io.springcloud.test;

import java.util.List;
import java.util.Map;


import io.springboot.demo.utils.XssCleaner;

public class MainTest {

    public static void main(String[] args) throws Exception {
        String content = """
                    <DIV class="content" style="width: 100px;">

                        <a src="https://springcloud.io">springcloud</a>

                        <script>
                            console.log("Hi");
                        </script>

                        <img src="https://springcloud.io/logo.png"/>

                        <a href="javascript:alert();">click this link</a>

                        <A href="mailto:admin@springcloud.io">Email</a>
                    </DIV>
                """;

        XssCleaner xssCleaner = XssCleaner.builder()
                .safeTags(List.of("a", "abbr", "acronym", "address", "area", "article", "aside", "audio", "b", "bdi",
                        "big", "blockquote", "br", "caption", "cite", "code", "col", "colgroup", "datalist", "dd",
                        "del", "details", "div", "dl", "dt", "em", "fieldset", "figcaption", "figure", "footer", "h1",
                        "h2", "h3", "h4", "h5", "h6", "hr", "i", "img", "li", "ins", "ol", "p", "pre", "q", "ul",
                        "small", "span"))
                .safeAttributes(List.of("style", "title"))
                .safeTagAttributes(Map.of("a", List.of("href"), "img", List.of("src")))
                .enforcedAttributes(Map.of("a", Map.of("rel", "nofollow")))
                .safeProtocols(Map.of("a", Map.of("href", List.of("#", "http", "https", "ftp", "mailto")), "blockquote",
                        Map.of("cite", List.of("http", "https")), "cite", Map.of("cite", List.of("http", "https")), "q",
                        Map.of("cite", List.of("http", "https"))))
                .build();

        System.out.println(xssCleaner.clean(content));

        System.out.println(xssCleaner.json());
    }
}

Console outputs:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
    <div style="width: 100px;">

        <a rel="nofollow">springcloud</a>

        

        <img src="https://springcloud.io/logo.png">

        <a rel="nofollow">click this link</a>

        <a href="mailto:admin@springcloud.io" rel="nofollow">Email</a>
    </div>

{
  "safeTags": [
    "a",
    "abbr",
    "acronym",
    "address",
    "area",
    "article",
    "aside",
    "audio",
    "b",
    "bdi",
    "big",
    "blockquote",
    "br",
    "caption",
    "cite",
    "code",
    "col",
    "colgroup",
    "datalist",
    "dd",
    "del",
    "details",
    "div",
    "dl",
    "dt",
    "em",
    "fieldset",
    "figcaption",
    "figure",
    "footer",
    "h1",
    "h2",
    "h3",
    "h4",
    "h5",
    "h6",
    "hr",
    "i",
    "img",
    "li",
    "ins",
    "ol",
    "p",
    "pre",
    "q",
    "ul",
    "small",
    "span"
  ],
  "safeAttributes": [
    "style",
    "title"
  ],
  "safeTagAttributes": {
    "a": [
      "href"
    ],
    "img": [
      "src"
    ]
  },
  "enforcedAttributes": {
    "a": {
      "rel": "nofollow"
    }
  },
  "safeProtocols": {
    "a": {
      "href": [
        "#",
        "http",
        "https",
        "ftp",
        "mailto"
      ]
    },
    "q": {
      "cite": [
        "http",
        "https"
      ]
    },
    "blockquote": {
      "cite": [
        "http",
        "https"
      ]
    },
    "cite": {
      "cite": [
        "http",
        "https"
      ]
    }
  }
}

Everything looks normal. The tags and attributes that are not in the safe list are cleaned up.