Escape Unicode Character 'POPCORN' to HTML Entity
Asked Answered
W

4

5

I have a string with an emoji in it

I love ๐Ÿฟ

I need to escape that popcorn emoji with it's html entity so I get

I love 🍿

I'm am writing my code in Java and I have been trying different StringEscapeUtils libraries but haven't gotten it to work. Please help me figure out what I can use to escape special characters like Popcorn.

For reference:

Unicode Character Information

Unicode 8.0 (June 2015)

Westmorland answered 17/8, 2019 at 1:36 Comment(1)
If the receiving system expects an HTML document with a document encoding of US-ASCII, why not just serialize the entire document as such? Why focus on specific characters? โ€“ Castled
G
2

I would use CharSequence::codePoints to get an IntStream of the code points and map them to strings, and then collect them, concatenating to a single string:

public String escape(final String s) {
    return s.codePoints()
        .mapToObj(codePoint -> codePoint > 127 ?
            "&#x" + Integer.toHexString(codePoint) + ";" :
             new String(Character.toChars(codePoint)))
    .collect(Collectors.joining());
}

For the specified input, this produces:

I love 🍿
Gigantean answered 17/8, 2019 at 16:39 Comment(0)
M
3

It's a little hacky, because I don't believe there is a ready made library to do this; assuming you can't simply use UTF-8 (or UTF-16) on your HTML page (which should be able to render ๐Ÿฟ as is), you can use Character.codePointAt(CharSequence, int) and Character.offsetByCodePoints(CharSequence, int, int)1 to perform the conversion if the given character is outside the normal ASCII range. Something like,

String str = "I love ๐Ÿฟ";
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
    char ch = str.charAt(i);
    if (ch > 127) {
        sb.append(String.format("&#x%x;", Character.codePointAt(str, i)));
        i += Character.offsetByCodePoints(str, i, 1) - 1;
    } else {
        sb.append(ch);
    }
}
System.out.println(sb);

which outputs (as requested)

I love &#x1f37f;

1Edited based on helpful comments from Andreas.

Monotonous answered 17/8, 2019 at 2:5 Comment(4)
Iโ€™m not actually rendering this on an html page. Iโ€™m passing it to another system and my focus is on keeping the behavior the same as a legacy system. โ€“ Westmorland
You should encode anything above 127, not 255, so the result only consists of ASCII characters. โ€“ Prosector
Character.codePointCount(str, i, i + 1) always returns 1. I believe you meant i = Character.offsetByCodePoints(str, i, 1) - 1;, with the -1 at the end needed to offset the i++ in the for loop. --- To see the problem, insert e.g. ลˆ in the string, and the character immediately following will be skipped. โ€“ Prosector
I would prefer using str.codePoints() to get a stream and process the code points that way. Using codePointCount and offsetByCodePoints is too low-level, tedious, and easy to get wrong. โ€“ Gigantean
C
2

unbescape library

You may use the library unbescape for powerful, fast, and easy escape/unescape operations in Java.

Example

Add the dependency into the pom.xml file:

<dependency>
    <groupId>org.unbescape</groupId>
    <artifactId>unbescape</artifactId>
    <version>1.1.6.RELEASE</version>
</dependency>

The usage:

import org.unbescape.html.HtmlEscape;
import org.unbescape.html.HtmlEscapeLevel;
import org.unbescape.html.HtmlEscapeType;

<โ€ฆ>

final String inputString = "\uD83C\uDF7F";
final String escapedString = HtmlEscape.escapeHtml(
    inputString,
    HtmlEscapeType.HEXADECIMAL_REFERENCES,
    HtmlEscapeLevel.LEVEL_2_ALL_NON_ASCII_PLUS_MARKUP_SIGNIFICANT
);

// Here `escapedString` has the value: `&#x1f37f;`.

For your use case, probably, either HtmlEscapeType.HTML4_NAMED_REFERENCES_DEFAULT_TO_HEXA or HtmlEscapeType.HTML5_NAMED_REFERENCES_DEFAULT_TO_HEXA should be used instead of HtmlEscapeType.HEXADECIMAL_REFERENCES.

Catalepsy answered 17/8, 2019 at 2:41 Comment(0)
G
2

I would use CharSequence::codePoints to get an IntStream of the code points and map them to strings, and then collect them, concatenating to a single string:

public String escape(final String s) {
    return s.codePoints()
        .mapToObj(codePoint -> codePoint > 127 ?
            "&#x" + Integer.toHexString(codePoint) + ";" :
             new String(Character.toChars(codePoint)))
    .collect(Collectors.joining());
}

For the specified input, this produces:

I love &#x1f37f;
Gigantean answered 17/8, 2019 at 16:39 Comment(0)
G
1

UPDATE: Library no longer maintained.

emoji4j library

Normally the emoji4j library works. It has a simple htmlify method for HTML encoding.

For example:

String text = "I love ๐Ÿฟ";

EmojiUtils.htmlify(text); //returns "I love &#127871"

EmojiUtils.hexHtmlify(text); //returns "I love &#x1f37f"
Glanders answered 17/8, 2019 at 2:5 Comment(2)
This doesn't really answer the question. Given a string that contains an emoji plus other characters, this doesn't provide any way to escape that string. โ€“ Gigantean
@DavidConrad Thanks for pointing that out! I edited my answer so it uses the library's method for converting emojis to HTML. โ€“ Glanders

© 2022 - 2024 โ€” McMap. All rights reserved.