How to convert a CP949 RTF to a UTF-8 encoded RTF?

M

2

5

I wanna write a python script that converts file encoding from cp949 to utf8. The file is orginally encoded in cp949. My script is as follows:

cpstr = open('terms.rtf').read()  
utfstr = cpstr.decode('cp949').encode('utf-8')  
tmp  = open('terms_utf.rtf', 'w')  
tmp.write(utfstr)  
tmp.close()

But this doesn't change the encoding as I intended.

Mafalda answered 24/12, 2013 at 2:36 Comment(4)

First, what do you mean "it still 'cp949'"? – Prelude 24/12, 2013 at 2:40

terms_utf.rtf does not encoded as utf-8 – Mafalda 24/12, 2013 at 2:42

That response has no more information than the original question. – Prelude 24/12, 2013 at 2:42

Even after the edit, what do you mean "this doesn't change the encoding as I intended"? If you can't explain it, give us an example: a very short RTF file, as it appears in your editor, and the actual bytes in it as, say, a hexdump, then the actual bytes your code produces, and what you expected it to produce instead. – Prelude 24/12, 2013 at 2:58

P

11

There are three kinds of RTF, and I have no idea which kind you have. You can tell by opening the file in a plain-text editor, or just using less/more/cat/type/whatever to print it out to your terminal.

First, the easy cases: plaintext RTF.

A plaintext RTF file starts of with {\rtf, and all of the text within it is (as you'd expect) plain text—although sometimes runs of text will be broken up into separate runs with formatting commands—which start with \—in between them. Since all of the formatting commands are pure ASCII, if you convert a plaintext RTF from one charset to another (as long as both are supersets of ASCII, as cp949 and utf-8 both are), it should work fine.

However, the file may also have a formatting command that specifies what character set it's written in. This command looks like \ansicpg949. When an RTF editor like Wordpad opens your file, it will interpret all your nice UTF-8 data as cp949 data and mojibake the hell out of it unless you fix it.

The simplest way to fix it is to figure out what charset your editor wants to put there for UTF-8 files. Maybe it's \ansicpg65001, maybe it's \utf8, maybe it's something completely different. So just save a simple file as a UTF-8 RTF, then look at it in plain text, and see what it has in place of \ansicpg949, and replace the string in your file with the right one. (Note that code page 65001 is not really UTF-8, but it's close, and a lot of Microsoft code assumes they're the same…)

Also, some RTF editors (like Apple's TextEdit) will escape any non-ASCII characters (so, e.g., a é is stored as \'e9), so there's nothing to convert.

Finally, Office Open XML includes an XML spec for something that's called RTF, but isn't really the same thing. I believe many RTF editors can handle this. Fortunately, you can treat this the same way as plaintext RTF—all of the XML tags have pure-ASCII names.

The almost-as-easy case is compressed plaintext RTF. This is the same thing, but compressed with, I believe, zlib. Or it can actually be RTFD (which can be plaintext RTF together with a images and other things in separate files, or actual plain text with formatting runs stored in a separate file) in a .zip archive. Anyway, if you have one of these, the file command on most Unix systems should be able to detect it as "compressed RTF", at which point we can figure out what the specific format is and decompress it, and then you can edit it as plaintext RTF (or RTFD).

Needless to say, if you don't uncompress this first, you won't see any of your familiar text in the file—and you could easily end up breaking it so it can't be decompressed, or decompresses to garbage, by changing arbitrary bytes to different bytes.

Finally, the hard case: binary RTF.

The earliest versions of these were in an undocumented format, although they've been reverse-engineered. The later versions are public specs. Wikipedia has links to the specs. If you want to parse it manually you can, but it's going to be a substantial amount of code, and you're going to have to write it yourself.

A better solution would be to use one of the many libraries on PyPI that can convert RTF (including binary RTF) to other formats, which you can then edit easily.

Prelude answered 24/12, 2013 at 2:52 Comment(2)

file start with '{\rtf1\ansi\ansicpg949\ ... {*\generator Msftedit 5.41.21.2510;}' – Mafalda 24/12, 2013 at 2:58

@ArenaSon: Then it's a plaintext RTF, so you're in luck. Do you not understand my explanation? – Prelude 24/12, 2013 at 2:58

H

-1

import codecs
cpstr = codecs.open('terms.rtf','r','cp949').read()
u = cpstr.encode('cp949').decode('utf-8')
tmp  = open('terms_utf.rtf', 'w') 
tmp.write(u)  
tmp.close()

Halvah answered 24/12, 2013 at 2:55 Comment(1)

Why would you decode cp949 just to re-encode it as cp949 and then decode it as utf-8 to them implicitly re-encode it as whatever sys.getdefaultencoding() is? If that gives you anything but garbage, you're the luckiest man on earth. – Prelude 24/12, 2013 at 2:57

Recommended topics

Hot tags