Localization of singular/plural words - what are the different language rules for grammatical numbers?
Asked Answered
E

3

13

I have been developing a .NET string formatting library to assist with localization of an application. It's called SmartFormat and is open-source on GitHub.

One of the issues it tries to address is Grammatical Numbers. This is also known as "singular and plural forms" or "conditional formatting", and here's a snippet of what it looks like in English:

var message = "There {0:is|are} {0} {0:item|items} remaining";

// You can use the Smart.Format method just like using String.Format:
var output = Smart.Format(CultureInfo.CurrentUICulture, message, items.Count);

The English rule, as I'm sure you know, is that there are 2 forms (singular and plural) that can apply to nouns, verbs, and adjectives. If the quantity is 1 then singular is used, otherwise the plural is used.

I am now trying to "broaden my horizons" by implementing the correct rules for other languages! I have come to understand that some languages can have up to 4 plural forms, and it takes some logic to determine the correct form. I would like to expand my code to accomodate multiple languages. For example, I've heard that Russian, Polish, and Turkish, have pretty different rules than English, so that might be a great starting point.

However, I only speak English and Spanish, so how can I determine the correct grammatical rules for many common languages?

Edit: I also would like to know some good non-English "test phrases" for my unit tests here: What are some good non-English phrases with singular and plural forms that can be used to test an internationalization and localization library?

Enchase answered 21/8, 2011 at 5:40 Comment(6)
A good overview of the problem domain can be found in interglacial.com/tpj/13 -- they propose a solution for Perl, but the reasoning is applicable to any programming language, and very easy to follow. It points to problems with gettext, which you certainly ought to familiarize yourself with if you are trying to create a replacement for it.Natty
On the linguistic side, your bias to European languages is unsettling, albeit typical. Kudos for including Turkish, though. May I suggest adding Arabic (inflects by infix, rather than suffix or prefix) and Hindi (assimilation rules, the neighboring word affects this word's spelling).Natty
@Natty your comments are helpful, but would be better suited in the answer area.Enchase
Turkish only have one form! 1 ev 2 ev 3 evPalenque
@Natty - the link you provided seems to be broken - can you provide any other pointer to that material (like the name of the lib or whatever) as it sounds interesting.Deen
Issue 13 of The Perl Journal. The library is called Locale::Maketext and they seem to include a copy of the article in the distribution; search.cpan.org/~petdance/Locale-Maketext-1.12/lib/Locale/…Natty
G
10

Definitely, different languages have different pluralization rules. Especially interesting could be Arabic and Polish both of which contain quite a few plural forms.

If you want to learn more about these rules, please visit Unicode Common Locale Data Repository, namely Language Plural Rules.

There are quite a few interesting information there, unfortunately some of them are unfortunately wrong. I hope plural forms are correct (at least for Polish they are, as far as I could tell :) ).

Guthrie answered 21/8, 2011 at 9:32 Comment(10)
I am using the "Language Plural Rules" page to create my list of rules. Thanks for the link!Enchase
@Scott: No problem. I found it recently while searching for unrelated problem (upper-/lower-casing information). That is strange, because I knew about CLDR earlier on...Jitney
Nice link, but it does not address word order, assimilation, or case inflection at all.Natty
@tripleee: That is true but mind you that grammar rules could be pretty complex. This especially regards to Slavic languages. And you know what? The rules have been simplified over the course of few centuries. Still, I don't think that anyone could capture inflection rules for Polish language (for example). And if you can do that, what about the exceptions (there are so many of them)?Jitney
aclweb.org/aclwiki/index.php?title=Resources_for_Polish has some links to morphological analyzers for Polish. No idea if these can be used for generation also, per se, but it's not like it cannot be done.Natty
I could use some help understanding the issue you guys are talking about. Is there a reason, other than grammatical numbers, that Polish would be difficult to localize? I'm not sure that "word order, assimilation, or case inflection" is an issue, since the entire localized phrase will specify the proper grammar. Or am I mistaken?Enchase
For example, the phrase "Found {0} {0:file:files}" could be translated to "Znaleziono {0} {0:plik:pliki:plików}" correct? This is my goal.Enchase
@Scott: The example with files (pli-k/-ki/-ków) is one of examples of inflection, in this case depending on the quantity. The problem is, however if you want to use standard noun in some sentence, it too can vary in form depending on the context. Also it will vary based on masculine, feminine and neuter form (this regards even to non-Slavic languages, for example German). It is really hard to cover all these cases and it is best to avoid word concatenations and leave full sentences in resource files so that translators could take care of them. That's my point.Jitney
Thanks for the explanation. The SmartFormat library also has "Conditional Formatting", which can can be used to determine masculine/feminine forms, etc. For example: assuming you know the user's gender, you could do: Smart.Format(culture, "{0:She|He} has {1} email{1::s} in {0:her|his} inbox", user.Gender, emails.Count);, or perhaps Smart.Format(culture, "{0:Ella|Él} tiene {1} email{1::s} en su bandeja de entrada", user.Gender, emails.Count); The translator doesn't need to use the gender or quantity if the language doesn't require it, but is allowed to use it freely otherwise.Enchase
The links are dead :(Lombard
D
1

It would be nice if you provided in the question body a sample of the rules that you're using, what format do they take?

Anyway, in your example:

var message = "There {0:is:are} {0} {0:item:items} remaining";

you seem to be basing on the assumption that the selection in both choice segments is based on the same single rule, and that there is direct correspondence between the two choices - that is the same single rule would choose (is,item) or (are,items).

This assumption is not necessarily correct for other languages, take for example the fictitious language English-ez (just to make things easier to understand for the reader, I find examples in foreign languages irritating - I'm borrowing from Arabic but simplifying a lot). The rules for this language are as follows:

The first selection segment is the same as normal English:

is: count=1
are: count=0, count=2..infinity

The second selection segment has a different rule from normal English, assume the following simple rule:

item: count=1
item-da: count=2 # this language has a special dual form.
items: count=0, count=3..infinity 

Now the single rule solution would not be adequate - we can suggest a different form:

var message = "There {0:is:are@rule1} {0} {0:item:items@rule2} remaining";

This solution might have problems in other situations, but we are discussing the example you provided.

Check gettext (allows selection of full message to a single level) and ICU (allows selection of full message to multiple levels ie on multiple variables).

Deen answered 11/2, 2012 at 7:27 Comment(5)
The rules also depend on the number of parameters. For example, the English rules with 2 parameters: {0:one|many}, with 3 parameters: {0:zero|one|many}, with 4 parameters: {0:negative|zero|one|many}. For your example, I would create a rule for EN-EZ that followed the {0:item:item-da:items} pattern.Enchase
Sorry for the inconsistent use of {0:x:x:x} syntax vs {0:x|x|x}. I temporarily tried the former, but currently use the latter.Enchase
By number of parameters I assume you mean number of choices. If I understand you correctly, you're proposing multiple selection rules per locale, and the applicable rule is chosen based on the number of choices in a selection segment. This is inadequate - assume two selection segments in a single message with the same number of choices, but the rule for each segment is different, then you need two rules, and the rule selector can't be the number of choices in the segment. Can you provide a pointer to clearer documentation - regards.Deen
Correct; the current rule set for any locale might vary based on the number of choices. Take a look near the end of this file. For example, the en rule for 4 choices looks like this: if (c == 4) return (n < 0) ? 0 : (n == 0) ? 1 : (n == 1) ? 2 : 3;Enchase
So, are you saying that basing the rule on only locale and number of choices is inadequate for some languages? This is the kind of information I'm trying to learn. What language has this kind of rule, and can you provide an example?Enchase
D
0

The approach you have taken might work on most cases in English and Spanish but most likely fails on many other languages. The problem is that you only have one pattern that tries to solve all grammatical numbers.

var message = "There {0:is|are} {0} {0:item|items} remaining";

You need one pattern for each grammatical gender. Here I have combined two patterns together into a single multi pattern string.

var message = PluralFormat("one;There is {0} item remaining;other;There are {0} items remaining", count);

English uses two grammatical number: singular and plural. one starts singular pattern and other starts plural pattern.

When translated for example to Finnish that uses the same amount of grammatical numbers you would use

"one;{0} kappale jäljellä;other;{0} kappaletta jäljellä"

However Japanese use only one grammatical number so Japanese would only use other. Polish uses three grammatical numbers so it would contains one, few and many.

Secondly you would need the proper rules to choose the right pattern amount multiple patterns. Unicode consortium's CLDR contains the rules in XML file.

I have implemented an open source library that uses CLDR rules (converted from XML into C# code and included into the library) and multi patterns strings to support both grammatical numbers and grammatical genders.

https://github.com/jaska45/I18N

Using this library your samples turns into

var message = MultiPattern.Format("one;There is {0} item remaining;other;There are {0} items remaining", count);
Deathblow answered 7/11, 2017 at 0:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.