If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
Goals
Without normalization, if someone sets their password to "mañana" (ma\u00F1ana
) on one computer and tries to log in with "mañana" (ma\u006E\u0303ana
) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
- I'd like to ensure that those hash to the same thing.
- I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
Reference
Unicode normalization forms: http://unicode.org/reports/tr15/#Norm_Forms
Considerations
- Any normalization procedure may cause collisions, e.g.
"office" == "office"
. - Normalization can change the number of bytes in the string.
Further questions
- What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
- What happens if the server receives characters that are unassigned in its version of Unicode?