How can I sort an array of strings based on a non standard alphabet?
Asked Answered
R

3

2

I'm trying to sort an array of phrases in Esperanto by alphabetical order. Is there a way to use sort_by to accomplish this?

I'm checking each character of the string against its index in the Esperanto alphabet, with each increasing index being a step lower in sorting priority:

  esp_alph = " abcĉdefgĝhĥijĵklmnoprsŝtuŭvz"
  arr.sort_by {|string|  
    [esp_alph.index(string[0]),
     esp_alph.index(string[1]),
     esp_alph.index(string[2]),
     esp_alph.index(string[3])]}

However, this isn't a scalable solution, and it breaks if I have more conditions than I have characters in my string. It seems like I'm right at the cusp of a loop based on my string length, but I can't figure out how to implement it without syntax errors. Or is there a better way to go about solving this issue?

Remediosremedy answered 3/3, 2016 at 16:53 Comment(3)
Your code is invalid. What is end doing?Bureaucratic
It is not clear what you mean by "if I have more conditions than I have characters in my string".Bureaucratic
Thanks for pointing out that 'end', it was left over from the document i copied this from. What I mean by more conditions than characters is this; In my sort_by block, each character is tested with a separate line of code to find it's index relative to the esp_alph string. so if I have a string "abcd", and four lines describing each of those characters in terms of their location in the esp_alph string, the block works. However, if I run the block on the string "abc", it breaks because the line esp_alph.index(string[3]) is testing 'nil'. Condition was not the right word, thanks.Remediosremedy
B
2

Simply replace all characters in the Esperanto alphabet with some characters in the ASCII table so that the Esperanto alphabet order matches the ASCII order.

Suppose you have the Esperanto alphabets in the order you gave, which I assume are in the order they are supposed to be:

esp_alph = " abcĉdefgĝhĥijĵklmnoprsŝtuŭvz"

and take out any portion of the ASCII character table of the same length (notice that \\ is a single character):

ascii = "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\"

or

ascii = "@-\\"

Then, you can simply do:

arr.sort_by{|string| string.tr(esp_alph, ascii)}

Here, tr is faster than gsub, and I think it scales enough.

Bureaucratic answered 3/3, 2016 at 17:20 Comment(7)
This fails on combining diacritics.Trickish
@mudasobwa Aren't they expressed as single character each? In Ruby > 1.9, which handles multi-byte characters, I think it works.Bureaucratic
No. "ĉ".codepoints #⇒ [265], while "c\u0302".codepoints #⇒ [99, 770]. I nevertheless like this approach. Combined diacritics should be turned into former latin1 or vise versa before processing.Trickish
They are not simply multi-bytes. They are multi codepoints.Trickish
@mudasobwa But doesn't the OP's example have single codepoint characters?Bureaucratic
Well, technically we can’t distinguish "ĉ" and "c\u0302" by looking at it :) The OP’s example occasionally turned to have single codepoint chars. Yes. But smart people might tune the keyboard (I did for having both Spanish ñ and German ä accents simultaneously.) Anyway, I upvoted this solution since actually a) nobody save for me cares about these two different representations of accents and b) this solution might be easily updated to handle combined diacritics.Trickish
Fantastic, thank you. I'm still getting my head around the mechanics of how this works, but it's something to go on, I appreciate it!Remediosremedy
E
1
esp_alph = " abcĉĉdefgĝĝhĥĥijĵĵklmnoprsŝŝtuŭŭvz"

arr = ["abc\u0302a", "abĉa","abca" ]
p arr.sort_by {|string| string.chars.map{|c| esp_alph.index(c)}}
# => ["abca", "abĉa", "abĉa"]

For better performance the esp_alph string should be a Hash, probably.

Exorcist answered 3/3, 2016 at 17:10 Comment(5)
This fails on combined diacritics. It is not as easy.Trickish
@mudasobwa do you have an example?Exorcist
"abc\u0302", "abu\u0306" etc.Trickish
@mudasobwa Thank you. Code adapted.Exorcist
Rule 73: chars when followed by an Array method, each_char when followed by an Enumerable method, the latter to avoid the creation of an unneeded temporary array.Sewel
S
0
ESP_ALPH = "abcĉdefgĝhĥijĵklmnoprsŝtuŭvz"

ESP_MAP  = ESP_ALPH.each_char.with_index.to_a.to_h
  #=> {"a"=> 0, "b"=> 1, "c"=> 2, "ĉ"=> 3, "d"=> 4, "e"=> 5, "f"=> 6,
  #    "g"=> 7, "ĝ"=> 8, "h"=> 9, "ĥ"=>10, "i"=>11, "j"=>12, "ĵ"=>13,
  #    "k"=>14, "l"=>15, "m"=>16, "n"=>17, "o"=>18, "p"=>19, "r"=>20,
  #    "s"=>21, "ŝ"=>22, "t"=>23, "u"=>24, "ŭ"=>25, "v"=>26, "z"=>27}

def sort_esp(str)
  str.each_char.sort_by { |c| ESP_MAP[c] }.join
end

str = ESP_ALPH.chars.shuffle.join
  #=> "hlbzŭvŝerĝoipjafntĵsmgĉdukĥc"

sort_esp(str) == ESP_ALPH
  #=> true
Sewel answered 4/3, 2016 at 5:32 Comment(1)
@mudasobwa, in anticipation of your comment, I would think one could deal with diacritics by defining ESP_ALPH as an array of characters and changing the rest accordingly (e.g., remove both .each_char's and .char.) Yes?Sewel

© 2022 - 2024 — McMap. All rights reserved.