I am not familiar with R but it seems to have the same problem as many other programming languages: the lack of native Unicode support in the standard library. By "Unicode support" I mean chapter 3 from the Unicode standard (http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf), annexes from the Unicode Standard (especially the one that deals with collation http://unicode.org/reports/tr10/) and up-to-date versions of CLDR (http://cldr.unicode.org/). Essentially, there are ambiguous rules for sorting which cannot be standardized without picking some "true" method and neglecting cultural differences. Partially this has been mitigated by allowing multiple collation levels which neglect certain details (like diacritic marks), creating the Case-folding algorithm (in some cases toLower(toUpper(str)) != toLower(str)), defining collation rules through CLDR database but the problem remains intact. There are also issues like context-dependent comparison (http://unicode.org/reports/tr10/#Contextual_Sensitivity) which require you to use a mature solution which conforms to the Unicode Standard if you want to have a 'correct' string comparison.
There is a well-known library called ICU (International Components for Unicode) which implements a great amount of features from the Unicode standard in comparison to other libraries out there. It has implementations in C/C++ and Java (all open-source with BSD-like license but there are bindings to the C version for other languages, including R (https://cran.r-project.org/web/packages/stringi/, http://site.icu-project.org/related). So you could use the 'stringi' project for your text processing using ICU locales and collation facilities.
Update:
In order to use ICU collation methods you are going to need to get ICU4C (varies across different OSes) and then install a package for the R language:
install.packages('stringi')
Then you should import it
library(stringi)
after which you can use these types of functions (http://docs.rexamine.com/R-man/stringi/stri_compare.html). You can pass additional parameters to the collator being created at the end of these functions (http://docs.rexamine.com/R-man/stringi/stri_opts_collator.html) which is going to affect how the comparison is going to be performed.
stri_cmp_lt("WV", "WY", locale="lt_LT")
stri_cmp_lt("WV", "WY", locale="en_US")
stri_compare("WV", "WV", locale="en_US", strength='1')
For example, above 'strength' parameter sets the so called 'collation level' (http://unicode.org/reports/tr10/#Notation). The locale is specified by Language and Country Codes as specified here (http://userguide.icu-project.org/locale). You can use these functions to implement a custom sorting function (such as quicksort that uses these functions for comparison) because the built-in functions do not seem to provide any way to change the ordering predicate.
Update2: Or, even better than implementing your own sorting, just use the stri_sort
function which allows you to specify a custom ICU collator (http://docs.rexamine.com/R-man/stringi/stri_order.html) as follows:
stri_sort(state.abb, locale="en_US")
stri_sort(state.abb, locale="lt_LT")
[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WV" "WY"
[1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "IA" "ID" "IL" "IN"
[16] "KY" "KS" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE" "NH"
[31] "NY" "NJ" "NM" "NV" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VA"
[46] "VT" "WA" "WI" "WY" "WV"
Notice that WV and WY are in different positions for different locales now.
LC_ALL
instead ofLC_TIME
? As you may surmise,LC_TIME
only affects date/time related localisation. — Another thing, eschew Windows character encodings (such as codepage 1252). Use UTF-8 exclusively. – HomicideLC_ALL
works:> sort(LETTERS)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
Thank You, Konrad Rudolph. Write your comment as an answer, I want to accept it so that it would more helpful for others. – Sequesterstringi::stri_sort
and pass in any locale you want viastri_opts_collator
for language-dependent sorting w/o changing your environment. – Ceja