What are the R sorting rules of character vectors?
Asked Answered
C

2

24

R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.

For example:

sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"

Three questions:

  1. What is the technically correct terminology to describe this sort order?
  2. I can not find any reference to this in the manuals on CRAN. Where can I find a description of the sorting rules in R?
  3. is this any different from this sort of behaviour in other languages like C, Java, Perl or PHP?
Coiffeur answered 29/8, 2011 at 11:22 Comment(1)
Related to Do not ignore case in sorting character strings.Vagus
D
25

Details: for sort() states:

 The sort order for character vectors will depend on the collating
 sequence of the locale in use: see ‘Comparison’.  The sort order
 for factors is the order of their levels (which is particularly
 appropriate for ordered factors).

and help(Comparison) then shows:

 Comparison of strings in character vectors is lexicographicwithin
 the strings using the collating sequence of the locale in use:see
 ‘locales’.  The collating sequence of locales such as ‘en_US’ is
 normally different from ‘C’ (which should use ASCII) and can be
 surprising.  Beware of making _any_ assumptions about the 
 collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
 and collation is not necessarily character-by-character - in
 Danish ‘aa’ sorts as a single letter, after ‘z’.  In Welsh ‘ng’
 may or may not be a single sorting unit: if it is it follows ‘g’.
 Some platforms may not respect the locale and always sort in
 numerical order of the bytes in an 8-bit locale, or in Unicode
 point order for a UTF-8 locale (and may not sort in the same order
 for the same language in different character sets).  Collation of
 non-letters (spaces, punctuation signs, hyphens, fractions and so
 on) is even more problematic.

so it depends on your locale setting.

Demimondaine answered 29/8, 2011 at 11:24 Comment(5)
D'oh. I was trying to find this in cran.r-project.org/doc/manuals/R-ints.html. Thank you.Coiffeur
I won't attempt to improve upon Dirk & the help's description, but outside of R, one might find it described as lexicographic sorting, albeit case-invariant. The collation rule is a serious consideration, as naive text processing is usually done with regard to English order, which is bad for some other languages. A good example is it makes name sorting look really weird to either native speakers or people who think only of 26 letters in strict A-Z order.Labial
and I've just spent a long time discovering that space characters may or may not be ignored, and that this changed depending on whether I was running tests locally, or doing R CMD checkSounding
A better method is to use stringr::str_sort and you can assign locale so it will have consistent result.Anatola
Only if you are willing or required to accept the heavier burden of stringr dependencies.Demimondaine
S
0

Sorting depends on locale. My solution for that is the following...

I create ~/.Renviron file

cat ~/.Renviron 
#LC_ALL=C

then in R sorting is in C locale

x=c("A", "B", "d", "F", "g", "H")
sort(x)
#[1] "A" "B" "F" "H" "d" "g"
Smyrna answered 16/9, 2020 at 18:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.