Escape Unicode Character 'POPCORN' to HTML Entity

Asked 17/8, 2019 at 1:36 Answered 17/8, 2019 at 16:39

Solved java apache-commons html-escape-characters unicode-escapes

I have a string with an emoji in it

I love 🍿

I need to escape that popcorn emoji with it's html entity so I get

I love &#x1f37f;

I'm am writing my code in Java and I have been trying different StringEscapeUtils libraries but haven't gotten it to work. Please help me figure out what I can use to escape special characters like Popcorn.

For reference:

Unicode Character Information

Unicode 8.0 (June 2015)

Westmorland answered 17/8, 2019 at 1:36 Comment(1)

If the receiving system expects an HTML document with a document encoding of US-ASCII, why not just serialize the entire document as such? Why focus on specific characters? – Castled 18/8, 2019 at 19:57

I would use CharSequence::codePoints to get an IntStream of the code points and map them to strings, and then collect them, concatenating to a single string:

public String escape(final String s) {
    return s.codePoints()
        .mapToObj(codePoint -> codePoint > 127 ?
            "&#x" + Integer.toHexString(codePoint) + ";" :
             new String(Character.toChars(codePoint)))
    .collect(Collectors.joining());
}

For the specified input, this produces:

I love &#x1f37f;

Gigantean answered 17/8, 2019 at 16:39 Comment(0)

It's a little hacky, because I don't believe there is a ready made library to do this; assuming you can't simply use UTF-8 (or UTF-16) on your HTML page (which should be able to render 🍿 as is), you can use Character.codePointAt(CharSequence, int) and Character.offsetByCodePoints(CharSequence, int, int)¹ to perform the conversion if the given character is outside the normal ASCII range. Something like,

String str = "I love 🍿";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
    char ch = str.charAt(i);
    if (ch > 127) {
        sb.append(String.format("&#x%x;", Character.codePointAt(str, i)));
        i += Character.offsetByCodePoints(str, i, 1) - 1;
    } else {
        sb.append(ch);
    }
}
System.out.println(sb);

which outputs (as requested)

I love &#x1f37f;

¹_{Edited based on helpful comments from Andreas.}

Monotonous answered 17/8, 2019 at 2:5 Comment(4)

I’m not actually rendering this on an html page. I’m passing it to another system and my focus is on keeping the behavior the same as a legacy system. – Westmorland 17/8, 2019 at 2:9

You should encode anything above 127, not 255, so the result only consists of ASCII characters. – Prosector 17/8, 2019 at 2:41

Character.codePointCount(str, i, i + 1) always returns 1. I believe you meant i = Character.offsetByCodePoints(str, i, 1) - 1;, with the -1 at the end needed to offset the i++ in the for loop. --- To see the problem, insert e.g. ň in the string, and the character immediately following will be skipped. – Prosector 17/8, 2019 at 2:49

I would prefer using str.codePoints() to get a stream and process the code points that way. Using codePointCount and offsetByCodePoints is too low-level, tedious, and easy to get wrong. – Gigantean 17/8, 2019 at 16:20

unbescape library

You may use the library unbescape for powerful, fast, and easy escape/unescape operations in Java.

Example

Add the dependency into the pom.xml file:

<dependency>
    <groupId>org.unbescape</groupId>
    <artifactId>unbescape</artifactId>
    <version>1.1.6.RELEASE</version>
</dependency>

The usage:

import org.unbescape.html.HtmlEscape;
import org.unbescape.html.HtmlEscapeLevel;
import org.unbescape.html.HtmlEscapeType;

<…>

final String inputString = "\uD83C\uDF7F";
final String escapedString = HtmlEscape.escapeHtml(
    inputString,
    HtmlEscapeType.HEXADECIMAL_REFERENCES,
    HtmlEscapeLevel.LEVEL_2_ALL_NON_ASCII_PLUS_MARKUP_SIGNIFICANT
);

// Here `escapedString` has the value: `&#x1f37f;`.

For your use case, probably, either HtmlEscapeType.HTML4_NAMED_REFERENCES_DEFAULT_TO_HEXA or HtmlEscapeType.HTML5_NAMED_REFERENCES_DEFAULT_TO_HEXA should be used instead of HtmlEscapeType.HEXADECIMAL_REFERENCES.

Catalepsy answered 17/8, 2019 at 2:41 Comment(0)

I would use CharSequence::codePoints to get an IntStream of the code points and map them to strings, and then collect them, concatenating to a single string:

public String escape(final String s) {
    return s.codePoints()
        .mapToObj(codePoint -> codePoint > 127 ?
            "&#x" + Integer.toHexString(codePoint) + ";" :
             new String(Character.toChars(codePoint)))
    .collect(Collectors.joining());
}

For the specified input, this produces:

I love &#x1f37f;

Gigantean answered 17/8, 2019 at 16:39 Comment(0)

UPDATE: Library no longer maintained.

emoji4j library

Normally the emoji4j library works. It has a simple htmlify method for HTML encoding.

For example:

String text = "I love 🍿";

EmojiUtils.htmlify(text); //returns "I love &#127871"

EmojiUtils.hexHtmlify(text); //returns "I love &#x1f37f"

Glanders answered 17/8, 2019 at 2:5 Comment(2)

This doesn't really answer the question. Given a string that contains an emoji plus other characters, this doesn't provide any way to escape that string. – Gigantean 19/8, 2019 at 16:48

@DavidConrad Thanks for pointing that out! I edited my answer so it uses the library's method for converting emojis to HTML. – Glanders 22/8, 2019 at 0:27

unbescape library

Example

emoji4j library

Recommended topics

Hot tags