Language dependent sorting with R
Asked Answered
S

2

7

1) How to sort correctly?

The task is to sort abbreviated US states names in accordance with English alphabet. But I noticed, that R sorts lists basing on some kind of operating system language or regional settings. E.g., in my language (Lithuanian) even the order of Latin (non-Lithuanian) letters differs from the order in the English alphabet. Compare order of non-Lithuanian letters only in both alphabets:

"ABCDEFGHI Y JKLMNOPRSTUVZ"

sort(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "Y" "J" "K" "L" "M" "N"
[16] "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Z"

vs.

"ABCDEFGHIJKLMNOPQRSTUVWX Y Z"

LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

So order of sorted abbreviations of the states also differ (notice the last 2, they should be "WV" and then "WY"):

sort(state.abb)
 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA"
[13] "ID" "IL" "IN" "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO"
[25] "MS" "MT" "NC" "ND" "NE" "NH" "NY" "NJ" "NM" "NV" "OH" "OK"
[37] "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA" "VT" "WA" "WI"
[49] "WY" "WV"

I tried Sys.setlocale("LC_TIME","English_United States.1252"). It helped to get English names of weekdays in plots, graphs and figures.

Now I need help to sort correctly in "English" way.

2) What are the other important language-dependent settings in R a beginner R user should pay attention to?

If you have advice, where R behaves language-dependently and how to deal with that, please list it.

Sequester answered 2/8, 2015 at 13:7 Comment(4)
Is your OS language/regional settings non-English?Sequester
Have you tried LC_ALL instead of LC_TIME? As you may surmise, LC_TIME only affects date/time related localisation. — Another thing, eschew Windows character encodings (such as codepage 1252). Use UTF-8 exclusively.Homicide
LC_ALL works: > sort(LETTERS) [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" [16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" Thank You, Konrad Rudolph. Write your comment as an answer, I want to accept it so that it would more helpful for others.Sequester
use stringi::stri_sort and pass in any locale you want via stri_opts_collator for language-dependent sorting w/o changing your environment.Ceja
H
6

LC_TIME controls date/time related language collation. For your purposes, LC_ALL should do the trick:

Sys.setenv('LC_ALL', 'English_United States.1252')
sort(letters)

However, beware that these settings are operating system specific. The above would for instance not work on a typical Unix system. Instead, the string 'en_US.UTF-8' is generally a good setting — but under Windows, that itself may pose problems as R’s Unicode support is sketchy on Windows.

Homicide answered 2/8, 2015 at 13:45 Comment(0)
R
7

I am not familiar with R but it seems to have the same problem as many other programming languages: the lack of native Unicode support in the standard library. By "Unicode support" I mean chapter 3 from the Unicode standard (http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf), annexes from the Unicode Standard (especially the one that deals with collation http://unicode.org/reports/tr10/) and up-to-date versions of CLDR (http://cldr.unicode.org/). Essentially, there are ambiguous rules for sorting which cannot be standardized without picking some "true" method and neglecting cultural differences. Partially this has been mitigated by allowing multiple collation levels which neglect certain details (like diacritic marks), creating the Case-folding algorithm (in some cases toLower(toUpper(str)) != toLower(str)), defining collation rules through CLDR database but the problem remains intact. There are also issues like context-dependent comparison (http://unicode.org/reports/tr10/#Contextual_Sensitivity) which require you to use a mature solution which conforms to the Unicode Standard if you want to have a 'correct' string comparison.

There is a well-known library called ICU (International Components for Unicode) which implements a great amount of features from the Unicode standard in comparison to other libraries out there. It has implementations in C/C++ and Java (all open-source with BSD-like license but there are bindings to the C version for other languages, including R (https://cran.r-project.org/web/packages/stringi/, http://site.icu-project.org/related). So you could use the 'stringi' project for your text processing using ICU locales and collation facilities.

Update: In order to use ICU collation methods you are going to need to get ICU4C (varies across different OSes) and then install a package for the R language:

install.packages('stringi')

Then you should import it

library(stringi)

after which you can use these types of functions (http://docs.rexamine.com/R-man/stringi/stri_compare.html). You can pass additional parameters to the collator being created at the end of these functions (http://docs.rexamine.com/R-man/stringi/stri_opts_collator.html) which is going to affect how the comparison is going to be performed.

stri_cmp_lt("WV", "WY", locale="lt_LT")
stri_cmp_lt("WV", "WY", locale="en_US")
stri_compare("WV", "WV", locale="en_US", strength='1')

For example, above 'strength' parameter sets the so called 'collation level' (http://unicode.org/reports/tr10/#Notation). The locale is specified by Language and Country Codes as specified here (http://userguide.icu-project.org/locale). You can use these functions to implement a custom sorting function (such as quicksort that uses these functions for comparison) because the built-in functions do not seem to provide any way to change the ordering predicate.

Update2: Or, even better than implementing your own sorting, just use the stri_sort function which allows you to specify a custom ICU collator (http://docs.rexamine.com/R-man/stringi/stri_order.html) as follows:

stri_sort(state.abb, locale="en_US")
stri_sort(state.abb, locale="lt_LT")

[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WV" "WY"
 [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NY" "NJ" "NM" "NV" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WY" "WV"

Notice that WV and WY are in different positions for different locales now.

Rational answered 3/8, 2015 at 18:39 Comment(2)
this is more of an opinion piece than an answer to the question; the only part that directly addresses the question is your last sentence, which isn't much of an answer. Could you expand the last sentence to give some actual examples (e.g. as in @hrbrmstr's comment above)? (I think the rest of the commentary would be OK if it accompanied an answer ...)Peace
@BenBolker This was more like a pointer for the "how to deal with that" part but I am going to try to give an example related to the US states. The Unicode Standard is huge so there is much reading to be done to really understand why there is no 'true' way to sort a given sequence of strings.Rational
H
6

LC_TIME controls date/time related language collation. For your purposes, LC_ALL should do the trick:

Sys.setenv('LC_ALL', 'English_United States.1252')
sort(letters)

However, beware that these settings are operating system specific. The above would for instance not work on a typical Unix system. Instead, the string 'en_US.UTF-8' is generally a good setting — but under Windows, that itself may pose problems as R’s Unicode support is sketchy on Windows.

Homicide answered 2/8, 2015 at 13:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.