I've been reading a lot on the subject of Unicode, but I remain very confused about normalization and its different forms. In short, I am working on a project that involves extracting text from PDF files and performing some semantic text analysis.
I've managed to satisfactorily extract the text using a simple python script, but now I need to make sure that all equivalent orthographic strings have one (and only one) representation. For example, the 'fi' typographic ligature should be decomposed into 'f' and 'i'.
I see that python's unicodedata.normalize
function offers several algorithms for normalizing unicode code points. Could someone please explain the difference between:
- NFC
- NFKC
- NFD
- NFKD
I read the relevant wikipedia article, but it was far too opaque for my feeble brain to understand. Could someone kindly explain this to me in plain English?
Also, could you please make a recommendation for the normalization method best adapted to a natural language processing project?
e
, for example) and its diacritic. Furthermore NFKD is the form of normalization that performs compatibility decomposition. Did I miss anything? – Forefront