Percent encoding in Ruby

About

Asked 9/7, 2015 at 12:55 Answered 9/7, 2015 at 13:27

In Ruby, I get the percent-encoding of 'ä' by

require 'cgi'
CGI.escape('ä')
=> "%C3%A4"

The same with

'ä'.unpack('H2' * 'ä'.bytesize)
=> ["c3", "a4"]

I have two questions:

What is the reverse of the first operation? Shouldn't it be
```
["c3", "a4"].pack('H2' * 'ä'.bytesize)
=> "\xC3\xA4"
```
For my application I need 'ä' to be encoded as "%E4" which is the hex-value of 'ä'.ord. Is there any Ruby-method for it?

Illyrian answered 9/7, 2015 at 12:55 Comment(5)

Why would you need "%e4"? I couldn't convert it back to that unicode char. – Parakeet 9/7, 2015 at 13:6

I think you need to understand that decimal values 128-255 do not uniquely represent a particular character, unless you also specify the encoding that goes along with it. It looks like you're using ISO 8859-1? – Isaiah 9/7, 2015 at 13:10

Perhaps helpful: "\xE4".encode('utf-8','iso-8859-1') produces "ä". "ä".encode('iso-8859-1').codepoints.first.to_s(16) returns "e4". – Isaiah 9/7, 2015 at 13:17

A shorter version of the above URI::escape('ä'.encode('iso-8859-1')) (you need to require 'uri'. – Lundgren 9/7, 2015 at 13:21

I wanted "%E4" since this is what my email client displays correctly as an 'ä' while "%C3%A4" is displayed as "Ã¤" in the mail-address. My strings in ruby are utf-8. So conversion to iso-8859-1 may be a solution for this particular email client and I have to test it with others. – Illyrian 9/7, 2015 at 13:45

As I mentioned in my comment, equating the character ä as the codepoint 228 (0xE4) implies that you're dealing with the ISO 8859-1 character encoding.

So, you need to tell Ruby what encoding you want for your string.

str1 = "Hullo ängstrom" # uses whatever encoding is current, generally utf-8
str2 = str1.encode('iso-8859-1')

Then you can encode it as you like:

require 'cgi'
s2c = CGI.escape str2
#=> "Hullo+%E4ngstrom" 

require 'uri'
s2u = URI.escape str2
#=> "Hullo%20%E4ngstrom"

Then, to reverse it, you must first (a) unescape the value, and then (b) turn the encoding back into what you're used to (likely UTF-8), telling Ruby what character encoding it should interpret the codepoints as:

s3a = CGI.unescape(s2c)  #=> "Hullo \xE4ngstrom"
puts s3a.encode('utf-8','iso-8859-1')
#=> "Hullo ängstrom"

s3b = URI.unescape(s2u)  #=> "Hullo \xE4ngstrom"
puts s3b.encode('utf-8','iso-8859-1')
#=> "Hullo ängstrom"

Isaiah answered 9/7, 2015 at 13:27 Comment(1)

Note that "\xE4" is not a string with 4 characters; it is a string with a single character whose codepoint is decimal 228 (hex E4). Ruby displays \xE4 in the inspect version of the string because that codepoint is not valid in UTF-8, which happens to be my default string encoding. – Isaiah 9/7, 2015 at 13:38

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags