Unescaping characters in a string with Ruby
Asked Answered
B

4

3

Given a string in the following format (the Posterous API returns posts in this format):

s="\\u003Cp\\u003E"

How can I convert it to the actual ascii characters such that s="<p>"?

On OSX, I successfully used Iconv.iconv('ascii', 'java', s) but once deployed to Heroku, I receive an Iconv::IllegalSequence exception. I'm guessing that the system Heroku deploys to does't support the java encoder.


I am using HTTParty to make a request to the Posterous API. If I use curl to make the same request then I do not get the double slashes.

From HTTParty github page:

Automatic parsing of JSON and XML into ruby hashes based on response content-type

The Posterous API returns JSON (no double slashes) and HTTParty's JSON parsing is inserting the double slash.


Here is a simple example of the way I am using HTTParty to make the request.

class Posterous
  include HTTParty
  base_uri "http://www.posterous.com/api/2"
  basic_auth "username", "password"
  format :json
  def get_posts
    response = Posterous.get("/users/me/sites/9876/posts&api_token=1234")
    # snip, see below...
  end
end

With the obvious information (username, password, site_id, api_token) replaced with valid values.

At the point of snip, response.body contains a Ruby string that is in JSON format and response.parsed_response contains a Ruby hash object which HTTParty created by parsing the JSON response from the Posterous API.

In both cases the unicode sequences such as \u003C have been changed to \\u003C.

Ballflower answered 13/11, 2010 at 5:55 Comment(5)
do you use the same version of ruby on you system as it is used on heroku ?Sharleensharlene
It looks like they are both running 1.8.7.Ballflower
HTTParty has a format command that lets you specify the format being returned and to be parsed. Do you have that set?Toothless
Also, it'd help if you added some sample code showing how you're making your call.Toothless
@Greg Thanks for the tip about HTTParty#format. I had been looking for something like that. Unfortunately, adding format :json doesn't affect the result at all.Ballflower
B
1

I ran into this exact problem the other day. There is a bug in the json parser that HTTParty uses (Crack gem) - basically it uses a case-sensitive regexp for the Unicode sequences, so because Posterous puts out A-F instead of a-f, Crack isn't unescaping them. I submitted a pull request to fix this.

In the meantime HTTParty nicely lets you specify alternate parsers so you can do ::JSON.parse bypassing Crack entirely like this:

class JsonParser < HTTParty::Parser
  def json
    ::JSON.parse(body)
  end
end

class Posterous
   include HTTParty
   parser ::JsonParser

   #....
end
Bombast answered 28/4, 2011 at 3:25 Comment(1)
+1 I just noticed your answer, a year and a half later. Thanks for the information!Ballflower
B
3

I've found a solution to this problem. I ran across this gist. elskwid had the identical problem and ran the string through a JSON parser:

s = ::JSON.parse("\\u003Cp\\u003E")

Now, s = "<p>".

Ballflower answered 16/11, 2010 at 20:29 Comment(1)
I've edited the original question to clarify how I am making the request since it seems to be the reason for the double slashes. I'd love a better answer as to why this is happening.Ballflower
B
1

I ran into this exact problem the other day. There is a bug in the json parser that HTTParty uses (Crack gem) - basically it uses a case-sensitive regexp for the Unicode sequences, so because Posterous puts out A-F instead of a-f, Crack isn't unescaping them. I submitted a pull request to fix this.

In the meantime HTTParty nicely lets you specify alternate parsers so you can do ::JSON.parse bypassing Crack entirely like this:

class JsonParser < HTTParty::Parser
  def json
    ::JSON.parse(body)
  end
end

class Posterous
   include HTTParty
   parser ::JsonParser

   #....
end
Bombast answered 28/4, 2011 at 3:25 Comment(1)
+1 I just noticed your answer, a year and a half later. Thanks for the information!Ballflower
B
1

You can also use pack:

"a\\u00e4\\u3042".gsub(/\\u(....)/){[$1.hex].pack("U")} # "aäあ"

Or to do the reverse:

"aäあ".gsub(/[^ -~\n]/){"\\u%04x"%$&.ord} # "a\\u00e4\\u3042"
Bergson answered 5/12, 2015 at 17:29 Comment(1)
Wow, this is Samurai rubySaltandpepper
T
0

The doubled-backslashes almost look like a regular string being viewed in a debugger.

The string "\u003Cp\u003E" really is "<p>", only the \u003C is unicode for < and \003E is >.

>> "\u003Cp\u003E"  #=> "<p>"

If you are truly getting the string with doubled backslashes then you could try stripping one of the pair.

As a test, see how long the string is:

>> "\\u003Cp\\u003E".size #=> 13
>> "\u003Cp\u003E".size #=> 3
>> "<p>".size #=> 3

All the above was done using Ruby 1.9.2, which is Unicode aware. v1.8.7 wasn't. Here's what I get using 1.8.7's IRB for comparison:

>> "\u003Cp\u003E" #=> "u003Cpu003E"
Toothless answered 14/11, 2010 at 1:17 Comment(1)
I get the same behavior as above using the two different versions of Ruby. The question becomes, where are the double slashes coming from? I will continue to investigate.Ballflower

© 2022 - 2024 — McMap. All rights reserved.