Ruby encoding ASCII_8BIT and extended ASCII
Asked Answered
C

2

5

About ASCII_8BIT

Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string. But as the name insists, its characters in the range of ASCII are considered as ASCII characters. This is useful when you use ASCII-8BIT characters with other ASCII compatible characters.

Source: ruby-doc.org/core-2.6.4

Context

I want to use ASCII_8BIT because I need to encode all characters between 0x00 (0d00) and 0xff (0d255), so ASCII (0-127) plus extended ASCII (128-255). ASCII (the encoding, US-ASCII) is a 7 bits encoding that recognizes only ASCII (the charset) characters (0-127). As the name states I was expecting that ASCII_8BIT will extends it to 8 bits to add support for 128-255.

Issue

When I use chr the encoding is automatically set to ASCII_8BIT but when I put I put a char between 128-255 (0x80-0xff) directly in a string and then ask what is the encoding I got UTF-8 instead and if I try to convert it to ASCII_8BIT is get an error.

irb(main):014:0> 0x8f.chr
=> "\x8F"
irb(main):015:0> 0x8f.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):016:0> "\x8f".encode(Encoding::ASCII_8BIT)
Traceback (most recent call last):
        5: from /usr/bin/irb:23:in `<main>'
        4: from /usr/bin/irb:23:in `load'
        3: from /usr/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        2: from (irb):16
        1: from (irb):16:in `encode'
Encoding::InvalidByteSequenceError ("\x8F" on UTF-8)
irb(main):021:0> "\x8F".encoding
=> #<Encoding:UTF-8>

Is there a bug in ruby core? I need to be able to encode everything between 8

The other name of ASCII 8BIT is BINARY because as the previous quote stated it should be able to encode any byte.

irb(main):035:0> Encoding::ASCII_8BIT.names
=> ["ASCII-8BIT", "BINARY"]

Other encodings

Please telling me to use another encoding is not the answer to the question unless it is an encoding that really map all 255 extended ASCII characters.

  • I don't want to use UTF-8 because the encoding is Multi-byte and not single-byte.
  • ISO/IEC 8859-1 (Latin1, 8bits) contains only 191 chars (ASCII + 63 chars)

    One notable way in which ISO character sets differ from code pages is that the character positions 128 to 159, corresponding to ASCII control characters with the high-order bit set, are specifically unused and undefined in the ISO standards, though they had often been used for printable characters in proprietary code pages, a breaking of ISO standards that was almost universal. Ref. Extended ASCII- ISO 8859 and proprietary adaptations

  • Windows-1252 (CP-1252, 8bits) doesn't contains all 255 chars and as different mappings that enxtended ASCII

Available encodings in ruby:

irb(main):036:0> Encoding.name_list
=> ["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", "UTF8-MAC", "EUC-JP", "Windows-31J", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-KR", "EUC-TW", "GB2312", "GB18030", "GBK", "ISO-8859-1", "ISO-8859-2", "ISO-8859-3", "ISO-8859-4", "ISO-8859-5", "ISO-8859-6", "ISO-8859-7", "ISO-8859-8", "ISO-8859-9", "ISO-8859-10", "ISO-8859-11", "ISO-8859-13", "ISO-8859-14", "ISO-8859-15", "ISO-8859-16", "KOI8-R", "KOI8-U", "Shift_JIS", "Windows-1250", "Windows-1251", "Windows-1252", "Windows-1253", "Windows-1254", "Windows-1257", "BINARY", "IBM437", "CP437", "IBM737", "CP737", "IBM775", "CP775", "CP850", "IBM850", "IBM852", "CP852", "IBM855", "CP855", "IBM857", "CP857", "IBM860", "CP860", "IBM861", "CP861", "IBM862", "CP862", "IBM863", "CP863", "IBM864", "CP864", "IBM865", "CP865", "IBM866", "CP866", "IBM869", "CP869", "Windows-1258", "CP1258", "GB1988", "macCentEuro", "macCroatian", "macCyrillic", "macGreek", "macIceland", "macRoman", "macRomania", "macThai", "macTurkish", "macUkraine", "CP950", "Big5-HKSCS:2008", "CP951", "IBM037", "ebcdic-cp-us", "stateless-ISO-2022-JP", "eucJP", "eucJP-ms", "euc-jp-ms", "CP51932", "EUC-JIS-2004", "EUC-JISX0213", "eucKR", "eucTW", "EUC-CN", "eucCN", "GB12345", "CP936", "ISO-2022-JP", "ISO2022-JP", "ISO-2022-JP-2", "ISO2022-JP2", "CP50220", "CP50221", "ISO8859-1", "ISO8859-2", "ISO8859-3", "ISO8859-4", "ISO8859-5", "ISO8859-6", "Windows-1256", "CP1256", "ISO8859-7", "ISO8859-8", "Windows-1255", "CP1255", "ISO8859-9", "ISO8859-10", "ISO8859-11", "TIS-620", "Windows-874", "CP874", "ISO8859-13", "ISO8859-14", "ISO8859-15", "ISO8859-16", "CP878", "MacJapanese", "MacJapan", "ASCII", "ANSI_X3.4-1968", "646", "UTF-7", "CP65000", "CP65001", "UTF-8-MAC", "UTF-8-HFS", "UCS-2BE", "UCS-4BE", "UCS-4LE", "CP932", "csWindows31J", "SJIS", "PCK", "CP1250", "CP1251", "CP1252", "CP1253", "CP1254", "CP1257", "UTF8-DoCoMo", "SJIS-DoCoMo", "UTF8-KDDI", "SJIS-KDDI", "ISO-2022-JP-KDDI", "stateless-ISO-2022-JP-KDDI", "UTF8-SoftBank", "SJIS-SoftBank", "locale", "external", "filesystem", "internal"]

For comparison python encodings https://docs.python.org/3/library/codecs.html#standard-encodings

Considerations

By reading Extended ASCII - Multi-byte character encodings it seems that the only true extended ASCII encoding is UTF-8 but is Multi-byte . It seems that no true extended ASCII single byte encoding exists either.

In a byte point of view I could use any 8bits (single byte) encoding as said here Extended ASCII - Usage in computer-readable languages

all ASCII bytes (0x00 to 0x7F) have the same meaning in all variants of extended ASCII,

But the problem is that implementations like ISO-8859-1 specifically undefined some char ranges and so will end in errors.

irb(main):009:0> (0..255).map { |c| c.chr}.join.encode(Encoding::ISO_8859_1)
Traceback (most recent call last):
        6: from /usr/bin/irb:23:in `<main>'
        5: from /usr/bin/irb:23:in `load'
        4: from /usr/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        3: from (irb):9
        2: from (irb):9:in `rescue in irb_binding'
        1: from (irb):9:in `encode'
Encoding::UndefinedConversionError ("\x80" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to ISO-8859-1)

Update - force_encoding

I found the string method force_encoding.

irb(main)> a = "\x8f"
=> "\x8F"
irb(main)> a.encoding
=> #<Encoding:UTF-8>
irb(main)> a.encode(Encoding::ASCII_8BIT)
Traceback (most recent call last):
        5: from /usr/bin/irb:23:in `<main>'
        4: from /usr/bin/irb:23:in `load'
        3: from /usr/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        2: from (irb):42
        1: from (irb):42:in `encode'
Encoding::InvalidByteSequenceError ("\x8F" on UTF-8)
irb(main)> a.force_encoding(Encoding::ASCII_8BIT)
=> "\x8F"
irb(main):040:0> a.encoding
=> #<Encoding:ASCII-8BIT>

What is the danger of using force_encoding rather than encode? Is it just that if I'm passing a multi-byte char accidentally it will be converted to multiple single byte chars? So not dangerous if one is assured that all characters passed to the application are in the extended ASCII range (single byte) but unsafe and will cause silent conversion issue if some UTF-8 chars are passed to the application for example.

irb(main):044:0> "\ud087".force_encoding(Encoding::ASCII_8BIT)
=> "\xED\x82\x87"
irb(main):045:0> "\ud087".bytes
=> [237, 130, 135]

Update - Answer

What @mu-is-too-short 's answer and @ForeverZer0 comment are suggesting is that I should rather use pack and unpack to deal with raw bytes.

So rather than using an encoding and workarounding with it

pattern = 'A' * 2606 + "\x8F\x35\x4A\x5F" + 'C' * 390
pattern.force_encoding(Encoding::ASCII_8BIT)

I should use bytes directly

pattern = ['A'.ord] * 2606 + [0x8F, 0x35, 0x4A, 0x5F] + ['C'.ord] * 390
pattern = pattern.pack('C*')

Or this easier to read syntax

pattern = 'A'.bytes * 2606 + "\x8F\x35\x4A\x5F".bytes + 'C'.bytes * 390
pattern = pattern.pack('C*')
Cohbath answered 18/9, 2019 at 21:39 Comment(5)
What kind of data are you dealing with? ASCII_8BIT is not actually an encoding, it's more of a non-encoding and there are no "extended ASCII characters", they're not defined formally. DOS ANSI (codepage 437) is one of a multitude of 8-bit encodings, as is Latin-1, Windows-1252, etc. Which format is your source data in? If you're dealing with raw binary data the answer is BINARY which translates to ASCII_8BIT by default, or in other words, it preserves the bytes and does no conversion.Bonney
@Bonney None of them. I'm manipulating raw TCP socket for some network protocol and I want to send raw bytes. So I wanted a proper extended ASCII encoding to be sure that when I send 0x8f or whatever else I'm really sending 0x8f and not a multiple bytes as I can be the case when using UTF-8 (which is the default when I provides extended ASCII chars in a string) or any other multi-bytes encoding.Cohbath
force_encoding does nothing more than force Ruby to interpret the same exact data differently, the data all remains the same, it just gets looked at differently. encode actually converts the data and returns different data.Bertiebertila
@Bertiebertila So, so far, the best option seems to use .force_encoding(Encoding::ASCII) or force_encoding(Encoding::ASCII_8BIT) to be sure to send raw bytes and not send converted multi-bytes for example if the encoding would have otherwise be automatically set to UTF-8.Cohbath
@Cohbath Best way to accomplish that just be to use pack and unpack to ensure you are getting raw binary data, and it is not trying to use an encoding. Ruby using strings to represent raw data is usually convenient, but can make certain use-cases such as yours a little more nuanced.Bertiebertila
C
5

String literals are (usually) UTF-8 encoded regardless of whether or not the bytes are valid UTF-8. Hence this:

"\x8f".encoding

saying UTF-8 even though the string isn't valid UTF-8. You should be safe using String#force_encoding but if you really want to work with raw bytes, you might be better of working with arrays of integers and using Array#pack to mash them into strings:

[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*')
# "\x8F\x11\x06#\xFF\x00" 
[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*').encoding
# #<Encoding:ASCII-8BIT> 
[ 0x8f, 0x11, 0x06, 0x23, 0xff, 0x00 ].pack('C*').bytes
# [143, 17, 6, 35, 255, 0] 

The results should be the same but, IMO, this is explicitly working with binary data (i.e. raw bytes), makes your intent clear, and should avoid any encoding problems.

There's also String#unpack if there is a known structure to the bytes you're reading and you want to crack it open.

Connection answered 18/9, 2019 at 22:24 Comment(2)
So rather than using ruby pattern = 'A' * 2606 + "\x8F\x35\x4A\x5F" + 'C' * 390; pattern.force_encoding(Encoding::ASCII) I should use ruby pattern = ['A'.ord] * 2606 + [0x8F, 0x35, 0x4A, 0x5F] + ['C'.ord] * 390; pattern.pack('C*') Cohbath
It is mostly a matter of taste (but I'd say force_encoding('binary') just to be explicit) but if I was working with bytes then I'd use pack.Connection
B
3

If you're doing raw packet manipulation then everything must be in BINARY / ASCII_8BIT mode because it is not text and should not be treated as such. If you have any encoding other than that Ruby will try and convert it, which will in the best case severely mess up the binary data, and in the worst case crash because of conversion errors.

In Ruby terms, ASCII_8BIT is effectively a raw data buffer.

The default encoding for strings in your code is UTF-8:

p "example".encoding
# => #<Encoding:UTF-8>

You can set the Ruby encoding for inline strings per-file with # encoding: BINARY:

# encoding: BINARY

p "example".encoding
# => #<Encoding:ASCII-8BIT>

It's typically better to express binary data using tools like pack, as mu points out, where you can't get it wrong and you're not really using strings in the first place. This is doubly important because 8-bit values are easy to handle, but 16 and 32-bit values must be properly endian encoded, so you'll often see a lot of this:

header = pack('nn', qtype, qclass)

Where that's composing a DNS header that involves two 16-bit values.

Bonney answered 19/9, 2019 at 5:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.