How can I sort an array of strings based on a non standard alphabet?

Asked 3/3, 2016 at 16:53 Answered 4/3, 2016 at 5:32

I'm trying to sort an array of phrases in Esperanto by alphabetical order. Is there a way to use sort_by to accomplish this?

I'm checking each character of the string against its index in the Esperanto alphabet, with each increasing index being a step lower in sorting priority:

  esp_alph = " abcĉdefgĝhĥijĵklmnoprsŝtuŭvz"
  arr.sort_by {|string|  
    [esp_alph.index(string[0]),
     esp_alph.index(string[1]),
     esp_alph.index(string[2]),
     esp_alph.index(string[3])]}

However, this isn't a scalable solution, and it breaks if I have more conditions than I have characters in my string. It seems like I'm right at the cusp of a loop based on my string length, but I can't figure out how to implement it without syntax errors. Or is there a better way to go about solving this issue?

Remediosremedy answered 3/3, 2016 at 16:53 Comment(3)

Your code is invalid. What is end doing? – Bureaucratic 3/3, 2016 at 17:8

It is not clear what you mean by "if I have more conditions than I have characters in my string". – Bureaucratic 3/3, 2016 at 17:11

Thanks for pointing out that 'end', it was left over from the document i copied this from. What I mean by more conditions than characters is this; In my sort_by block, each character is tested with a separate line of code to find it's index relative to the esp_alph string. so if I have a string "abcd", and four lines describing each of those characters in terms of their location in the esp_alph string, the block works. However, if I run the block on the string "abc", it breaks because the line esp_alph.index(string[3]) is testing 'nil'. Condition was not the right word, thanks. – Remediosremedy 3/3, 2016 at 18:39

Simply replace all characters in the Esperanto alphabet with some characters in the ASCII table so that the Esperanto alphabet order matches the ASCII order.

Suppose you have the Esperanto alphabets in the order you gave, which I assume are in the order they are supposed to be:

esp_alph = " abcĉdefgĝhĥijĵklmnoprsŝtuŭvz"

and take out any portion of the ASCII character table of the same length (notice that \\ is a single character):

ascii = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\"

ascii = "@-\\"

Then, you can simply do:

arr.sort_by{|string| string.tr(esp_alph, ascii)}

Here, tr is faster than gsub, and I think it scales enough.

Bureaucratic answered 3/3, 2016 at 17:20 Comment(7)

This fails on combining diacritics. – Trickish 3/3, 2016 at 17:22

@mudasobwa Aren't they expressed as single character each? In Ruby > 1.9, which handles multi-byte characters, I think it works. – Bureaucratic 3/3, 2016 at 17:23

No. "ĉ".codepoints #⇒ [265], while "c\u0302".codepoints #⇒ [99, 770]. I nevertheless like this approach. Combined diacritics should be turned into former latin1 or vise versa before processing. – Trickish 3/3, 2016 at 17:25

They are not simply multi-bytes. They are multi codepoints. – Trickish 3/3, 2016 at 17:26

@mudasobwa But doesn't the OP's example have single codepoint characters? – Bureaucratic 3/3, 2016 at 17:27

Well, technically we can’t distinguish "ĉ" and "c\u0302" by looking at it :) The OP’s example occasionally turned to have single codepoint chars. Yes. But smart people might tune the keyboard (I did for having both Spanish ñ and German ä accents simultaneously.) Anyway, I upvoted this solution since actually a) nobody save for me cares about these two different representations of accents and b) this solution might be easily updated to handle combined diacritics. – Trickish 3/3, 2016 at 17:31

Fantastic, thank you. I'm still getting my head around the mechanics of how this works, but it's something to go on, I appreciate it! – Remediosremedy 3/3, 2016 at 18:49

esp_alph = " abcĉĉdefgĝĝhĥĥijĵĵklmnoprsŝŝtuŭŭvz"

arr = ["abc\u0302a", "abĉa","abca" ]
p arr.sort_by {|string| string.chars.map{|c| esp_alph.index(c)}}
# => ["abca", "abĉa", "abĉa"]

For better performance the esp_alph string should be a Hash, probably.

Exorcist answered 3/3, 2016 at 17:10 Comment(5)

This fails on combined diacritics. It is not as easy. – Trickish 3/3, 2016 at 17:11

@mudasobwa do you have an example? – Exorcist 3/3, 2016 at 17:12

"abc\u0302", "abu\u0306" etc. – Trickish 3/3, 2016 at 17:21

@mudasobwa Thank you. Code adapted. – Exorcist 3/3, 2016 at 20:48

Rule 73: chars when followed by an Array method, each_char when followed by an Enumerable method, the latter to avoid the creation of an unneeded temporary array. – Sewel 4/3, 2016 at 6:29

ESP_ALPH = "abcĉdefgĝhĥijĵklmnoprsŝtuŭvz"

ESP_MAP  = ESP_ALPH.each_char.with_index.to_a.to_h
  #=> {"a"=> 0, "b"=> 1, "c"=> 2, "ĉ"=> 3, "d"=> 4, "e"=> 5, "f"=> 6,
  #    "g"=> 7, "ĝ"=> 8, "h"=> 9, "ĥ"=>10, "i"=>11, "j"=>12, "ĵ"=>13,
  #    "k"=>14, "l"=>15, "m"=>16, "n"=>17, "o"=>18, "p"=>19, "r"=>20,
  #    "s"=>21, "ŝ"=>22, "t"=>23, "u"=>24, "ŭ"=>25, "v"=>26, "z"=>27}

def sort_esp(str)
  str.each_char.sort_by { |c| ESP_MAP[c] }.join
end

str = ESP_ALPH.chars.shuffle.join
  #=> "hlbzŭvŝerĝoipjafntĵsmgĉdukĥc"

sort_esp(str) == ESP_ALPH
  #=> true

Sewel answered 4/3, 2016 at 5:32 Comment(1)

@mudasobwa, in anticipation of your comment, I would think one could deal with diacritics by defining ESP_ALPH as an array of characters and changing the rest accordingly (e.g., remove both .each_char's and .char.) Yes? – Sewel 4/3, 2016 at 5:41

Recommended topics

Hot tags