In what 8-bit character set is 0x9d meaningful?

Asked 18/8, 2017 at 5:27 Answered 14/7, 2024 at 14:40

python string unicode utf-8 character-encoding

In what 8-bit ASCII-like character set for English is 0x9d meaningful? I'm cleaning up some old data files, and occasionally finding a 0x9d in otherwise-ASCII text. (No, it's not UTF-8.)

It's not valid in Windows-1252. The Python "latin-1" codec translates it to Unicode 0x9D, which is "Operating System Command". That makes little sense. In Unicode you get a box with [009d]. (In Python, you can convert anything to Latin-1 without errors being raised, but that doesn't mean it's meaningful to do so.)

Examples, with Python-type escapes, from a messy database I'm cleaning up that combines text from many sources:

Guitar Pro, JamPlay, RedBana\\\'s Audition,\x9d Doppleganger\x99s The Lounge\x9d or Heatwave Interactive\x99s Platinum Life Country,\\"

for example \\"I\\\'ve seen the bull run in Pamplona, Spain\x9d.\\" Everything

Netwise Depot is  a \\"One Stop Web Shop\\"\x9d that provides sustainable \\"green\\"\x9d living

are looking for a \\"Do It for Me\\"\x9d solution

From the context, I'd suspect ™ or ®. But what 8-bit code had those?

Biles answered 18/8, 2017 at 5:27 Comment(6)

\0x99 is indeed ™ in Windows-1250 and 1252. – Apollo 18/8, 2017 at 5:39

Possibly related. Experience of copying text out of a PDF file superuser.com/questions/1146479/… – Apollo 18/8, 2017 at 5:59

All the examples above can be found in Company Details box on Crunchbase. Could be crunchbase specific. – Apollo 18/8, 2017 at 7:39

Yes, this is Crunchbase data. It's their 2013 historical snapshot, which is available if you ask. That 0x9D made it all the way through to their current web page, where it renders as invisible. – Biles 19/8, 2017 at 6:5

@DmitriChubarov 404 – Savagism 17/11, 2020 at 6:35

@SmartManoj web.archive.org/web/20161118112057/https://superuser.com/… – Apollo 17/11, 2020 at 8:53

Here's a completely wild hypothesis:

Some prior (really broken) system working on this data attempted to write each character as UTF-8, but actually only wrote the last byte of each sequence (maybe it had a weird one-byte-long buffer somewhere). Alternatively, it was in UTF-8 in the past, but somebody viewing it in a different encoding did a search-and-replace to remove bytes 0xE2 0x80 because they clearly "didn't belong" and didn't realize that the remaining "special character" wasn't the one they wanted either.

ASCII, would of course, be passed through as its UTF-8 encoding would be one byte long.

The 'RIGHT SINGLE QUOTATION MARK' (U+2019) ’ is encoded in UTF-8 with bytes 0xE2 0x80 0x99. The places where you have \x99s is what made me go down this path, since the apostrophe before an s would often be translated to a right curly quotation mark in popular word processing software. If only the last byte of the character was saved, you'd just have the 0x99 there.

The 'RIGHT DOUBLE QUOTATION MARK' (U+201D) ” is encoded in UTF-8 with bytes 0xE2 0x80 0x9D. The 0x9D that you have in your text is often at the end of a double-quoted string. And, it's often right next to a regular straight " double-quote. I wonder if somebody had tried to do some sort of prior clean-up pass on the data, and managed to put back in the closing quote, but left the "weird" 0x9D in there.

As I said, it's a wild hypothesis, but if this is a conglomeration of data from a variety of old systems, it's hard to know what exactly may have happened to it. The last byte of UTF-8 was just the closest "normal" English encoding I could find that would have something reasonable in English text and included the bytes you were looking for.

Odontalgia answered 18/8, 2017 at 17:25 Comment(1)

There is another field where something like that happened. There's a "normalized name" field, which is forced to lower case. But it was forced to lower case as if ASCII, even when the data was UTF-8. That resulted in things like KACMAZLAR MEKANİK -> kacmazlar mekanä°k, Anita Calçados -> anita calã§ados, Felfria Resor för att Koh Lanta -> felfria resor fã¶r att koh lanta. But that doesn't seem to be the source of the 09d problem. Anyway, I've decided to just discard all 0x9d characters for everything that doesn't parse as UTF-8 or Windows-1252. – Biles 19/8, 2017 at 6:7

In Windows-1256, used for Arabic locales, \x99 is a trademark sign and \x9d is a zero width non-joiner. That would seem to be plausible in the listed positions, though likely redundant. There's certainly no shortage of character sets to try though.

One tool to attempt the guess automatically is chardet.

Alvar answered 18/8, 2017 at 6:19 Comment(0)

May be the data comes from a DOS file (CP850).

In my experience in that case the character 0x9D was used as a "diameter" sign when referring to pipes or tubes.

Congruity answered 18/8, 2017 at 6:37 Comment(0)

For me, the character was a heart emoji. That was the only suspect I had and when I removed it, it stopped providing me the UnicodeDecodeError I was getting while trying to read a file in python. Using the UTF-8 encoding instantly solved the issue though and it loaded fine.

Wheelman answered 14/7, 2024 at 14:40 Comment(0)

-1

I'm going to close this out, because, after asking in several places, it's clear that there's no common extended ASCII 8-bit data encoding that uses 0x9D in a way that makes sense here.

This may be the result of long-ago munging on the data. There are other Stack Overflow questions about Python charset conversions failing on 0x9D specifically, so it's not unique to this data. Somewhere, there's something that sticks in a 0x9D once in a while, usually after quotes. Maybe some old word processor. Thanks, everyone.

Biles answered 19/8, 2017 at 6:22 Comment(0)

Recommended topics

Hot tags