How to decode and encode Hebrew strings?
Asked Answered
P

1

27

I am trying to encode and decode the Hebrew string "שלום". However, after encoding, I get gibberish:

>>> word = "שלום"
>>> word = word.decode('UTF-8')
>>> word
u'\u05e9\u05dc\u05d5\u05dd'
>>> print word
שלום
>>> word = word.encode('UTF-8')
>>> word
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> print word
׳©׳׳•׳

How should I do it properly?

Phallus answered 24/4, 2015 at 15:2 Comment(8)
b'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d' are the bytes that make up the utf8 string. When you print them as a string it it looks gibberish (in python2 (assuming std default encoding) but would look as in my comment in py3). If you then decode those bytes back using utf8 you will end up with the unicde string you started from.Standard
whats the result of sys.getdefaultencoding() in your terminal?Satchel
I get the string 'ascii'.Phallus
Can you add the python version you are using, please!Apples
It's Python 2.7.3 and I'm using Pyscripter.Phallus
On 2.7.6, it works fine! Your code looks correct and there should be no major differences in that between the two. Have you tried running that directly through the Python interpreter?Apples
```>>> word = "שלום" >>> word '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d' >>> print word שלום >>> word = word.decode('UTF-8') >>> word u'\u05e9\u05dc\u05d5\u05dd' >>> print word שלום >>> word = word.encode('UTF-8') >>> word '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d' >>> print word שלום >>>Harrie
what if i want to write both english and hebrew to the same file? which encoding do i use?Zenia
H
27

You'll have to make sure you have the right encoding in your environment (shell or script). If you're using a script include the following:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

To make sure your environment knows you're using UTF-8. You may find that your shell terminal will accept only ASCII, so make sure it is able to support UTF-8.

>>> word = "שלום"
>>> word
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> print word
שלום
>>> word = word.decode('UTF-8')
>>> word
u'\u05e9\u05dc\u05d5\u05dd'
>>> print word
שלום
>>> word = word.encode('UTF-8')
>>> word
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> print word
שלום
>>>
Harrie answered 24/4, 2015 at 16:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.