Numbers localization in desktop applications

Asked 16/12, 2012 at 8:41 Answered 25/12, 2012 at 1:16

unicode localization numbers character desktop-application

In the number decimal category of Unicode, 460 decimal characters are defined (see this page for some examples). Unfortunately I could not find any character representing a digit regardless of its appearance. As a result, currently only Western Arabic numeral characters are understood by most software as digits. So you can not for example enter other number characters in the MS Excel.

Table of digits in various writing systems

If Unicode had (at least) 10 code for digits 0 to 9 as pure numbers, not a glyph, we could use them in almost all normal usage, and host environment could show localized number glyphs according to user's locale. Also we could use any of the 460 decimal Unicode numbers when we want to work with number glyphs as a string.

On the other hand, if we accept the current characters U+0030 to u+0039 as pure digit numbers, then we need ten new character for Western Arabic numerals. This implementation seems also to be more backward compatible. Also the names of the characters U+0030 to U+0039 do not refer to any specific number's appearance.

Obviously we can hard-code all 460 decimal numeral characters in the app and internally treat with them as numbers, but I am looking for a more suitable solution. The issue becomes more complicated if we also consider 224+464 other Unicode number characters (Nl category + No category) that include Roman and Old-Persian numbers.

How can we solve this issue with an OS wide solution?

Lordly answered 16/12, 2012 at 8:41 Comment(9)

There's a flaw in your reasoning. You are assuming that there are 10, and 10 only digits in all locales. In Chinese there are "digits" for 10, 100, 1,000, 10,000 and 100,000,000. There may be other languages with different systems. Besides, Unicode encodes characters, not concepts. – Corrinacorrine 16/12, 2012 at 12:25

@Corrinacorrine Thanks for comment, apparently I mentioned to 460 decimal number and 224 letter numbers and 464 other numbers. – Lordly 16/12, 2012 at 13:44

@Corrinacorrine Also note that all 46 decimal character sets can be treated identically in mathematical app, but to accept other numbers as a input for computation the method to converting to decimal number should also be provided for app. Also thanks for your edits. – Lordly 16/12, 2012 at 14:3

I don't get the point of this question. Unicode is just a list of characters. What characters are accepted by a calculator program is entirely up to the programmer of that app, Unicode doesn't have anything to do with it. If you want to write one that accepts Roman numerals then it is entirely up to you. – Plunder 16/12, 2012 at 14:38

@HansPassant unicode is not just a list of characters, most of the work in unicode is about the properties of characters and their relationships – Apian 16/12, 2012 at 17:30

I don't understand what question PHPst is asking either. The title is a statement, not a question. The first two paragraphs talk about one topic (number characters in Unicode), but asks no question. The third paragraph asks a question, about a different topic. @Hans says, write an app that behaves the way you want. That seems like the best answer to me. – Eugenieeugenio 19/12, 2012 at 1:42

You seem to be missing the 'Eastern Arabic-Indic' digits, U+06F0..U+06F9. – Tonsure 19/12, 2012 at 6:48

Just out of curiosity, are there any left-to-right vs right-to-left ordering issues to deal with? If the numbers are written in Arabic, is the storage order still with most significant digit before least significant digit? Is the display done with the MSD on the left or the right of the number? – Tonsure 25/12, 2012 at 2:31

@JonathanLeffler the Unicode standard calls for characters always to be stored in reading order, regardless of whether the display order is left-to-right or right-to-left. – Eugenieeugenio 25/12, 2012 at 5:20

I'm not exactly sure what you are asking, but the nearest thing to a specific question seems to be, "in the current situation, how should we handle numbers in mathematical applications in a manner where users can see their local number glyphs?"

Very simple: write your own mathematical application. It will have a Model of its data, for instance, an integer number or a real number. It will also have a View of that data, for instance, a character string expressing the number in a notation the user knows how to read. (These terms refer to the Model-View-Controller architecture.) In your own application, write code for your View that displays the number using Arabic number characters, or Bengali number characters, or Chinese number characters, or whatever representation you desire.

As Esailija writes, the Common Locale Data Repository (CLDR) and the International Classes for Unicode (ICU) libraries can help you write this application.

You write,

I could not find any character representing a digit regardless of its appearance. As a result, currently only Western Arabic numeral characters are understood by most (or perhaps all) software as numbers. So you can not enter other number characters in MS Excel.

I think these three sentences don't have a logical connection.

The reason you can't enter other number characters in Microsoft Excel is that Microsoft made a business decision that the Excel was useful enough if it represented numbers only with Western digits, and it was not necessary for them to build the multilingual spreadsheet you seek.

The reason currently only Western Arabic numeral characters are understood by most (or perhaps all) software as numbers is because many other software developers have made the same business decision as Microsoft. It is not because of how digits are encoded in Unicode.

You are correct that the Unicode standard has no character representing a digit regardless of its appearance. That is because the Unicode standard deals with characters, using a very detailed model of what are and are not characters. The Unicode Standard (usually) not with other abstract data model entities.

So: go and write the mathematical application which has the behaviour you want. The platform and APIs are open to you. The Unicode Standard and CLDR and ICU provide you with tools. Do great things!

You add:

Obviously we can hard-code all 460 decimal numeral characters in the app and internally treat with them as numbers, but I am looking for a more suitable solution.... How can we solve this issue with an OS wide solution?

What are your criteria for declaring a solution "suitable"? Hard-coding the decimal numeral characters, or more specifically writing a set of language specific routines to convert between abstract number data types to text representations in various languages, is the only way I see that will work. By "an OS wide solution", do you mean a solution which you can install into the OS, and it will change the behaviour of existing applications? Well, you can hope for that, but I don't think it will come to pass on current OS's.

Note that the language-specific routines could perhaps be implemented with the RuleBasedNumberFormat class of ICU. This class can format an abstract number as a string of text like '(e.g., 25,3476 as "twenty-five thousand three hundred seventy-six" or "vingt-cinq mille trois cents soixante-seize" or "fünfundzwanzigtausenddreihundertsechsundsiebzig")'. One can probably write code with this class to format numbers using any of the 46 language sets of digits you identified. However, application software would still need to incorporate ICU and the number format code.

Update: modified my answer to track wording changes in original poster's question. Added response to call for "OS wide solution". Repaired a link to Wikipedia at "Model-view-controller".

Update: deleted spurious word "the".

Eugenieeugenio answered 19/12, 2012 at 2:1 Comment(1)

"I think these three sentences don't have a logical connection." Money shot. – Corrinacorrine 23/12, 2012 at 11:27

You can find the numbering systems in CLDR. The id-attribute descriptions can be found in the bcp file for numbers.A Numbering system is either numeric or algorithimic, specified in the type-attribute. If it's "numeric", then the digits attribute contains digits in that system starting from 0. If it's "algorithmic", then the rules-attribute will refer to the rules used. Reading numbering system files

For the algorithimic rules for numbering systems, see the root.xml file in rbnf (Rule-based number formatting) folder. More about reading rbnf files.

The ICU libraries already implement this but you can also roll your own based on the data from above links, to convert from any numbering system characters to integers or vice versa.

Apian answered 16/12, 2012 at 10:36 Comment(0)

Unicode does not prescribe glyphs for characters. A character is considered to be an abstraction, independent of a specific shaping. So, in a sense, all characters are "regardless of appearance".

But to get to your question (I think), to perform this manner of localization would require a sequence of code points that represent a number to be first identified and converted to an actual number. I think no Unicode publication covers how to do this (even UTR 25 assumes Latin digits), and it's not necessarily going to be easy. For example, as noted, some code points have values outside the range 0-9, and numbers can appear left-to-right in otherwise right-to-left surrounding text.

Assuming you want to attempt this, however, you will need the Numeric Type and the Numeric Value of each code point; these are normative properties whose values are listed in UnicodeData.txt. They define the abstract value for each code point that represents a number (a number that is not necessarily a digit, mind). Once you have the abstract number, you would need to perform the reverse process of converting it to a locale-dependent sequence of code points that represents the same value.

Dinosaurian answered 25/12, 2012 at 1:16 Comment(0)

Recommended topics

Hot tags