Java convert character stream into human "readable" String
Asked Answered
H

1

6

I have a bunch of characters with that looks something like this:

Комуникационна кабелна система

and sometimes I have a mix like this:

Généralités

The first translates into :

Комуникационна кабелна система

and the second to:

Généralités

I can see this using a browser and place them into the body.

But how can I make java output the "real" characters ? What is the above encoding called?

I have tried a couple of things, and lastly this ( which did not work ):

import java.nio.charset.*;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;

List<String> lst = new ArrayList<String>(); lst.add("&#1050;"); lst.add("&#1086;");
for ( String s : lst ) {

    Charset utf8charset = Charset.forName("UTF-8");
    Charset iso88591charset = Charset.forName("ISO-8859-1");

    ByteBuffer inputBuffer = ByteBuffer.wrap( s.getBytes() );

    // decode UTF-8
    CharBuffer data = utf8charset.decode(inputBuffer);

    // encode ISO-8559-1
    ByteBuffer outputBuffer = iso88591charset.encode(data);
    byte[] outputData = outputBuffer.array();

    System.out.println ( new String(outputData) )
}
Haemal answered 14/3, 2012 at 14:47 Comment(2)
those are called entities. if you look for entity-to-unicode conversion you may find what you're looking for that wayRozina
@Rozina thanks for clarifying ! not the easiest thing to search the web for :)Haemal
W
7

You can use commons-lang to unescape this sort of thing. In Groovy:

@Grab( 'commons-lang:commons-lang:2.6' )
import org.apache.commons.lang.StringEscapeUtils as SEU

def str = 'G&#233;n&#233;ralit&#233;s'

println SEU.unescapeHtml( str )
Wheelbarrow answered 14/3, 2012 at 14:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.