How to make a Jsoup whitelist to accept certain attribute content
Asked Answered
B

1

7

I'm using Jsoup with relaxed whitelist. It seems perfect but I would like to keep the embedded images tags like <img alt="" src="data:;base64.

Is there a way to modify the whitelist to accept also those img?

Edit:

If I use Whitelist.relaxed().addProtocols("img","src","data") then those img tags are not removed. But it accepts anything after "data:" and I would like just to keep them if src content starts with "data:;base64". Is it possible with jsoup?

Bouse answered 16/3, 2014 at 22:50 Comment(2)
For me, I don't even have to whitelist it to keep that. Some more source HTML might be good aswell as your parsing code.Dexterous
Daniel: I'm using jsoup 1.7.2 with just Jsoup.clean(..., Whitelist.relaxed(). Any kind of img on the mentioned format is removed.Bouse
B
12

You can extend Whitelist and override isSafeAttribute to perform custom checks. As there's no way to extend Whitelist.relaxed() directly, you'll have to copy some code to set up the same list:

public class RelaxedPlusDataBase64Images extends Whitelist {
    public RelaxedPlusDataBase64Images() {
        //copied from Whitelist.relaxed()
        addTags("a", "b", "blockquote", "br", "caption", "cite", "code", "col",
                "colgroup", "dd", "div", "dl", "dt", "em", "h1", "h2", "h3", "h4", "h5", "h6",
                "i", "img", "li", "ol", "p", "pre", "q", "small", "strike", "strong",
                "sub", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "tr", "u",
                "ul");
        addAttributes("a", "href", "title");
        addAttributes("blockquote", "cite");
        addAttributes("col", "span", "width");
        addAttributes("colgroup", "span", "width");
        addAttributes("img", "align", "alt", "height", "src", "title", "width");
        addAttributes("ol", "start", "type");
        addAttributes("q", "cite");
        addAttributes("table", "summary", "width");
        addAttributes("td", "abbr", "axis", "colspan", "rowspan", "width");
        addAttributes("th", "abbr", "axis", "colspan", "rowspan", "scope", "width");
        addAttributes("ul", "type");
        addProtocols("a", "href", "ftp", "http", "https", "mailto");
        addProtocols("blockquote", "cite", "http", "https");
        addProtocols("cite", "cite", "http", "https");
        addProtocols("img", "src", "http", "https");
        addProtocols("q", "cite", "http", "https");
    }

    @Override
    protected boolean isSafeAttribute(String tagName, Element el, Attribute attr) {
        return ("img".equals(tagName)
                && "src".equals(attr.getKey())
                && attr.getValue().startsWith("data:;base64")) ||
            super.isSafeAttribute(tagName, el, attr);
    }
}

As you haven't provided the code you're using to parse or the HTML you're sanitizing, I haven't tested this.

Brokendown answered 30/6, 2014 at 19:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.