Which form of unicode normalization is appropriate for text mining?
Asked Answered
F

1

7

I've been reading a lot on the subject of Unicode, but I remain very confused about normalization and its different forms. In short, I am working on a project that involves extracting text from PDF files and performing some semantic text analysis.

I've managed to satisfactorily extract the text using a simple python script, but now I need to make sure that all equivalent orthographic strings have one (and only one) representation. For example, the 'fi' typographic ligature should be decomposed into 'f' and 'i'.

I see that python's unicodedata.normalize function offers several algorithms for normalizing unicode code points. Could someone please explain the difference between:

  • NFC
  • NFKC
  • NFD
  • NFKD

I read the relevant wikipedia article, but it was far too opaque for my feeble brain to understand. Could someone kindly explain this to me in plain English?

Also, could you please make a recommendation for the normalization method best adapted to a natural language processing project?

Forefront answered 27/6, 2012 at 19:5 Comment(0)
F
7

Characters like é can be written either as a single character or as a sequence of two, a regular e plus the accent (a diacritic). Normalization chooses consistently among such alternatives, and will order multiple diacritics in a consistent way.

Since you need to deal with ligatures, you should use "compatibility (de)composition", NFKD or NFKC, which normalizes ligatures. It's probably ok to either use composed or decomposed forms, but if you also want to do lossy matching (e.g., match é even if the user types plain e), you could use the compatibility decomposition NFKD and discard the diacritics for loose matching.

Fruition answered 27/6, 2012 at 23:28 Comment(3)
Alexis, thank you for your reply. Allow me to summarize your post to make sure I understand correctly. In short, I should select the "compatibility" criterion because (as I understand it), this is what separates ligatures into equivalent seperate-character forms. Also, using decomposition means that diacritics are separated into two characters, correct? In this way, I can look separately at the base letter (e, for example) and its diacritic. Furthermore NFKD is the form of normalization that performs compatibility decomposition. Did I miss anything?Forefront
I think you got it. Look carefully to see exactly what you can expect from the compatibility forms.Fruition
Note that NFKD is nice because it converts ligatures such as "fi" into separater characters fi. Whether its behavior of converting E=mc² into E=mc2 is great or not depends on your use case.Spiller

© 2022 - 2024 — McMap. All rights reserved.