Ruby incompatible character encodings
Asked Answered
T

1

9

I am currently trying to write a script that iterates over an input file and checks data on a website. If it finds the new data, it prints out to the terminal that it passes, if it doesn't it tells me it fails. And vice versa for deleted data. It was working fine until the input file I was given contains the "™" character. Then when ruby gets to that line, it is spitting out an error:

PDAPWeb.rb:73:in `include?': incompatible character encodings: UTF-8 and IBM437 (Encoding::CompatibilityError)

The offending line is a simple check to see if the text exists on the page.

if browser.text.include? (program_name)

Where the program_name variable is a parsed piece of information from the input file. In this instance, the program_name contains the 'TM' character mentioned before.

After some research I found that adding the line # encoding: utf-8 to the beginning of my script could help, but so far has not proven useful.

I added this to my program_name variable to see if it would help(and it allowed my script to run without errors), but now it is not properly finding the TM character when it should be.

program_name = record[2].gsub("\n", '').force_encoding("utf-8").encode("IBM437", replace: nil)

This seemed to convert the TM character to this: Γäó

I thought maybe i had IBM437 and utf-8 parts reversed, so I tried the opposite

program_name = record[2].gsub("\n", '').force_encoding("IBM437").encode("utf-8", replace: nil)

and am now receiving this error when attempting to run the script

PDAPWeb.rb:48:in `encode': U+2122 from UTF-8 to IBM437 (Encoding::UndefinedConve rsionError)

I am using ruby 1.9.3p392 (2013-02-22) and I'm not sure if I should upgrade as this is the standard version installed in my company.

Is my encoding incorrect and causing it to convert the TM character with errors?

Templetempler answered 16/5, 2015 at 20:19 Comment(9)
Is there a reason you are encoding to IBM437 ?Panda
I wasn't sure if i had the encoding order right, i was just going with what it said in the the error was initially (before i added any encoding)Templetempler
"I thought maybe i had IBM437 and utf-8 parts reversed, so I tried the opposite and am now receiving this error when attempting to run the script" Can you show this code. Something doesn't make sense.Panda
Using /\u{2122}/ to represent the TM symbol might help you. For example, try browser.goto 'http://graphemica.com/%E2%84%A2' and note that browser.h1(:text, /\u{2122}/).present? (the big TM symbol on the page) returns true.Featherstitch
@MartinKonecny updated the question with the part of the script i changed in my attempt to debug the problem. I switched the encoding around because I am not entirely sure which parts mean what. The weird thing is that notepad shows it as saving in ANSI encoding, but when i tried to do utf-8 from ansi, it said Ansi wasn't an encoding type.Templetempler
I think you may have your code and results/errors the wrong way round. You shouldn’t get an error from UTF-8 to IBM437 when you are encoding from IBM437 to UTF-8.Rovit
@Featherstitch your answer has helped me tremendously. I did some work in IRB and found that the string i need to check against was showing up as "Some text\u2122 some more text". I am having to do a manual find and replace in the input file, which isn't as straight forward as I had hoped, but progress is progress. :) thank youTempletempler
Glad it was helpful, @Todd J. We ran into a similar situation at work with the ® symbol and eventually discovered that solution. I hope you're able to get your code to work the way you want it to.Featherstitch
I also see this error when curly-quotes and other windows-specific non-ASCII characters make it into my Ruby source filesNaoma
R
11

Here’s what it looks like is going on. Your input file contains a character, and it is in UTF-8 encoding. However when you read it, since you don’t specify the encoding, Ruby assumes it is in your system’s default encoding of IBM437 (you must be on Windows).

This is basically the same as this:

>> input = "™"
=> "™"
>> input.encoding
=> #<Encoding:UTF-8>
>> input.force_encoding 'ibm437'
=> "\xE2\x84\xA2"

Note that force_encoding doesn’t change the actual string, just the label associated with it. This is the same outcome as in your case, only you arrive here via a different route (by reading the file).

The web page also has a symbol, and is also encoded as UTF-8, but in this case Ruby has the encoding correct (Watir probably uses the headers from the page):

>> web_page = '™'
=> "™"
>> web_page.encoding
=> #<Encoding:UTF-8>

Now when you try to compare these two strings you get the compatibility error, because they have different encodings:

>> web_page.include? input
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and IBM437
    from (irb):11:in `include?'
    from (irb):11
    from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'

If either of the two strings only contained ASCII characters (i.e. code points less that 128) then this comparison would have worked. Both UTF-8 and IBM437 are both supersets of ASCII, and are only incompatible if they both contain characters outside of the ASCII range. This is why you only started seeing this behaviour when the input file had a .

The fix is to inform Ruby what the actual encoding of the input file is. You can do this with the already loaded string:

>> input.force_encoding 'utf-8'
=> "™"

You can also do this when reading the file, e.g. (there are a few ways of reading files, they all should allow you to explicitly specify the encoding):

input = File.read("input_file.txt", :encoding => "utf-8")
# now input will be in the correct encoding

Note in both of these the string isn’t being changed, it still contains the same bytes, but Ruby now knows its correct encoding.

Now the comparison should work okay:

>> web_page.include? input
=> true

There is no need to encode the string. Here’s what happens if you do. First if you correct the encoding to UTF-8 then encode to IBM437:

>> input.force_encoding("utf-8").encode("IBM437", replace: nil)
Encoding::UndefinedConversionError: U+2122 from UTF-8 to IBM437
    from (irb):16:in `encode'
    from (irb):16
    from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'

IBM437 doesn’t include the character, so you can’t encode a string containing it to this encoding without losing data. By default Ruby raises an exception when this happens. You can force the encoding by using the :undef option, but the symbol is lost:

>> input.force_encoding("utf-8").encode("IBM437", :undef => :replace)
=> "?"

If you go the other way, first using force_encoding to IBM437 then encoding to UTF-8 you get the string Γäó:

>> input.force_encoding("IBM437").encode("utf-8", replace: nil)
=> "Γäó"

The string is already in IBM437 encoding as far as Ruby is concerned, so force_encoding doesn’t do anything. The UTF-8 representation of is the three bytes 0xe2 0x84 0xa2, and when interpreted as IBM437 these bytes correspond to the three characters seen here which are then converted into their UTF-8 representations.

(These two outcomes are the other way round from what you describe in the question, hence my comment above. I’m assuming that this is just a copy-and-paste error.)

Rovit answered 18/5, 2015 at 18:50 Comment(4)
One thing that i found out is that when i grab the string in the browser using watir using a simple line browser.p(:class => 'targeted_class').text it is returning 'Some text\u2122 some more text' so i actually do have to convert one of them so they can match each otherTempletempler
@ToddJ. I don’t think so. Ruby is printing \u2122 because the terminal doesn’t support printing the actual character (or Ruby doesn’t think it can). \u2122 is .Rovit
You were correct! I was able to do a test and everything is functioning as expected.Templetempler
I also see this error when curly-quotes and other windows-specific non-ASCII characters make it into my Ruby source filesNaoma

© 2022 - 2024 — McMap. All rights reserved.