How to properly Normalize a String with composite characters? - McMap

About

How to properly Normalize a String with composite characters?

Asked 22/1, 2018 at 15:32 Answered 10/5, 2018 at 15:32

Solved java unicode-normalization

A

1

23

Java Normalize already allows me to take accented characters and output non-accented characters. It does not, however, seem to deal with composite characters (Œ, Æ) very well at all.

Is there a way for Java to deal with these characters natively? I'd like to prevent having to keep a Map of these characters (as that was the reason we moved to using Normalize in the first place).

For example, an input of "Œ" should return "OE", in much the same way it already neatly decomposes characters such as "½" into "1/2".

Andean answered 22/1, 2018 at 15:32 Comment(13)

Please elaborate It does not, however, seem to deal with composite characters (Œ, Æ) very well at all – Phraseograph 22/1, 2018 at 16:2

@SotiriosDelimanolis I think he wants Normalizer.normalize("Œ", Normalizer.Form.NFD).equals("OE"); to be true. Me too. – Asberry 22/1, 2018 at 16:9

@SotiriosDelimanolis I hope this clarifies it :) – Andean 22/1, 2018 at 16:20

See the diagram in unicode.org/reports/tr15/tr15-23.html. It implies that you need Normalizer.Form.NFKD instead. – Eltonelucidate 22/1, 2018 at 16:29

@Eltonelucidate hum, that seems not be enough (I get empty string as a result too) – Asberry 22/1, 2018 at 16:46

@Eltonelucidate I am using NFKD, as that DOES help for the Ǌ composite - but not here. – Andean 23/1, 2018 at 8:49

See the comments following this answer: https://mcmap.net/q/550659/-separating-unicode-ligature-characters – Eltonelucidate 23/1, 2018 at 10:44

@Eltonelucidate I don't quite see how that helps? The issue remained unsolved in those comments. – Andean 23/1, 2018 at 16:23

@WeckarE. I know, it helps in the sense that it's telling you it can't be solved ;-) – Eltonelucidate 23/1, 2018 at 16:25

@Eltonelucidate I choose to believe a solution not having been found yet and no solution existing are two very different things. – Andean 25/1, 2018 at 9:23

Yeah. The fact your question is upvoted so much means that a solution would be highly desirable to many. – Eltonelucidate 25/1, 2018 at 10:17

Ok, try this: lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/2013/docs/… - especially the bit near the end that mentions icu4j. – Eltonelucidate 25/1, 2018 at 11:21

AFAIK it's possible only with icu4j. – Elementary 27/1, 2018 at 8:24

P

1

TLDR; No, there is no way with native java to handle these uniformly.

Long Answer

As noted in this question, Separating Unicode ligature characters, the Java Normalizer implementation does not support all of the ligatures that exist in written language.

The reason for this is because Unicode does not support all of the ligatures that exist in written language. Ligatures are a debated subject when it comes to the storage of written language because an argument can be made that they are unimportant from a data viewpoint and that they are important from a layout view point.

The Data viewpoint claims that no information is lost and so it makes more sense to only use the decomposed forms and that the composed forms should not be in Unicode.

The Layout viewpoint claims that the composed ligature represents the proper layout of the written form of language and so should be represented in the data with a special code.

Possible Solution

I would suggest creating a Service that has an interface that handles ligatures only. Supply a concrete implementation that handles all that you currently need. In the future if new implementations are needed it will be simple to add them without modifying the original code by simply adding a new JAR to the program class-path that adds the missing ligatures.

The skeletal implementation may look like this.

Please note I have omitted the code that actually uses the ServiceLoader to locate the LigatureDecoder and LigatureEncoder implementations.

final class Ligatures {
  public static CharSequence compose ( CharSequence decomposedCharacters );
  public static CharSequence decompose ( CharSequence composedCharacters );
}

interface LigatureDecoder {
  CharSequence decompose ( CharSequence composedCharacters );
}

interface LigatureEncoder {
  CharSequence compose ( CharSequence decomposedCharacters );
}

Patency answered 10/5, 2018 at 15:32 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.