Where is it specified whether Unicode identifiers should be allowed in a Haskell implementation?
Asked Answered
L

1

9

I wanted to write some educational code in Haskell with Unicode characters (non-Latin) in the identifiers. (So that the identifiers look nice and natural for speakers of a natural language other than English which is not using the Latin characters in its writing.) So, I set out for finding an appropriate Haskell implementation that would allow this.

But where is this feature specified in the language specification? How would I refer to this feature when looking for a conforming implementation? (And which Haskell implemenations are known to actually support Unicode identifiers?)

It turned out that one Haskell implementation did accept my code with Unicode identifiers, whereas another one failed to accept it. I would like it if there were a way to formalize this requirement of my code, in a form of a language feature switch perhaps, so that if I or someone else tries to run my code, it would be immediately clear whether his implementation is missing the required feature and hence he should look for another one. (There could be also a wiki page for this feature--"Unicode identifiers", which would list which of the existing implementations support it, so that one would know where to go if one needs it.)

(BTW, I have put a "syntax" tag on this question, but I actually perceive it to be an issue of the level of lexing, a lower level than the syntax of a language. Is there a tag here for features of the lexing level of a language, rather than for features of the syntax specification of a language?)

Lockage answered 1/4, 2011 at 18:34 Comment(3)
As for the last paragraph: It is part of the syntax. It's propably not part of the grammar, but the distinction between tokenization and parsing only exists in practice. And there are systems (e.g. Parsing Expression Grammars) which skip the tokenization step and cater to the actual source code.Hypophosphite
Details of my experiments: ghc-6.10.4-alt2 (IIRC) didn't accept Unicode identifiers, ghc-6.12.3-alt4 and ghc-7.0.1-alt1 allow Unicode identifiers, and hugs98-20060921-alt5 doesn't allow Unicode identifiers (which is unfortunate, because I thought it might be better for education than ghc because of its greater simplicity, i.e., perhaps simpler error messages).Lockage
More related experiments with Unicode ids: Ah, and I also gave curry-0.9.11 a try (for I might want to show some code with Curry "extensions"): this Muenster Curry Compiler didn't allow Unicode identifiers (IIRC).Lockage
A
10

The Online Report documents this under Lexemes. It also notes early on that "Haskell uses the Unicode character set. However, source programs are currently biased toward the ASCII character set used in earlier versions of Haskell.".

Actual compilers may or may not support Unicode identifiers. GHC does, but you need to keep in mind that Unicode codepoints must obey the same rules as ASCII characters: types must start with a codepoint which is classed as uppercase or titlecase, variables as lowercase (although de facto this is relaxed to alphabetic and not uppercase/titlecase; this might be worth asking for a clarification from the language committee), operators must be punctuation or symbol. (This means that you can't declare types in Arabic, for example, unless you prefix them with a character in some other script that is uppercase/titlecase.)

As to collecting Unicode support information: while I don't know of a single page that provides it, searching for "unicode" on the Haskell Wiki finds information about Unicode support in a number of Haskell compilers.

Assignat answered 1/4, 2011 at 18:58 Comment(2)
Thanks for such a quick and elaborate answer with links!Lockage
That's intriguing to learn about such unfortunate cases as Arabic ids. (Tested your predictions with Hebrew.) It could make sense for Haskell to relax these syntax rules for unicase writing systems, or state them differently: given the observation that it's quite natural in maths to use only Latin or Greek letters in formulae, tolerate inavailability of unicase letters for locally bound ids (and all var ids), and allow unicase letters in type names. Then global functions are left with no "localized" names,but at least there's _ which can be a neutral prefix for localized global function names.Lockage

© 2022 - 2024 — McMap. All rights reserved.