How to fix: "UnicodeDecodeError: 'ascii' codec can't decode byte"
Asked Answered
A

20

518
as3:~/ngokevin-site# nano content/blog/20140114_test-chinese.mkd
as3:~/ngokevin-site# wok
Traceback (most recent call last):
  File "/usr/local/bin/wok", line 4, in
    Engine()
  File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 104, in init
    self.load_pages()
  File "/usr/local/lib/python2.7/site-packages/wok/engine.py", line 238, in load_pages
    p = Page.from_file(os.path.join(root, f), self.options, self, renderer)
  File "/usr/local/lib/python2.7/site-packages/wok/page.py", line 111, in from_file
    page.meta['content'] = page.renderer.render(page.original)
  File "/usr/local/lib/python2.7/site-packages/wok/renderers.py", line 46, in render
    return markdown(plain, Markdown.plugins)
  File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 419, in markdown
    return md.convert(text)
  File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281, in convert
    source = unicode(source)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 1: ordinal not in range(128). -- Note: Markdown only accepts unicode input!

How to fix it?

In some other python-based static blog apps, Chinese post can be published successfully. Such as this app: http://github.com/vrypan/bucket3. In my site http://bc3.brite.biz/, Chinese post can be published successfully.

Aardwolf answered 15/1, 2014 at 4:15 Comment(1)
G
659

tl;dr / quick fix

  • Don't decode/encode willy nilly
  • Don't assume your strings are UTF-8 encoded
  • Try to convert strings to Unicode strings as soon as possible in your code
  • Fix your locale: How to solve UnicodeDecodeError in Python 3.6?
  • Don't be tempted to use quick reload hacks

Unicode Zen in Python 2.x - The Long Version

Without seeing the source it's difficult to know the root cause, so I'll have to speak generally.

UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2.x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string.

In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding. They only hold Unicode point codes and therefore can hold any Unicode point from across the entire spectrum. Strings contain encoded text, beit UTF-8, UTF-16, ISO-8895-1, GBK, Big5 etc. Strings are decoded to Unicode and Unicodes are encoded to strings. Files and text data are always transferred in encoded strings.

The Markdown module authors probably use unicode() (where the exception is thrown) as a quality gate to the rest of the code - it will convert ASCII or re-wrap existing Unicodes strings to a new Unicode string. The Markdown authors can't know the encoding of the incoming string so will rely on you to decode strings to Unicode strings before passing to Markdown.

Unicode strings can be declared in your code using the u prefix to strings. E.g.

>>> my_u = u'my ünicôdé strįng'
>>> type(my_u)
<type 'unicode'>

Unicode strings may also come from file, databases and network modules. When this happens, you don't need to worry about the encoding.

Gotchas

Conversion from str to Unicode can happen even when you don't explicitly call unicode().

The following scenarios cause UnicodeDecodeError exceptions:

# Explicit conversion without encoding
unicode('€')

# New style format string into Unicode string
# Python will try to convert value string to Unicode first
u"The currency is: {}".format('€')

# Old style format string into Unicode string
# Python will try to convert value string to Unicode first
u'The currency is: %s' % '€'

# Append string to Unicode
# Python will try to convert string to Unicode first
u'The currency is: ' + '€'         

Examples

In the following diagram, you can see how the word café has been encoded in either "UTF-8" or "Cp1252" encoding depending on the terminal type. In both examples, caf is just regular ascii. In UTF-8, é is encoded using two bytes. In "Cp1252", é is 0xE9 (which is also happens to be the Unicode point value (it's no coincidence)). The correct decode() is invoked and conversion to a Python Unicode is successfull: Diagram of a string being converted to a Python Unicode string

In this diagram, decode() is called with ascii (which is the same as calling unicode() without an encoding given). As ASCII can't contain bytes greater than 0x7F, this will throw a UnicodeDecodeError exception:

Diagram of a string being converted to a Python Unicode string with the wrong encoding

The Unicode Sandwich

It's good practice to form a Unicode sandwich in your code, where you decode all incoming data to Unicode strings, work with Unicodes, then encode to strs on the way out. This saves you from worrying about the encoding of strings in the middle of your code.

Input / Decode

Source code

If you need to bake non-ASCII into your source code, just create Unicode strings by prefixing the string with a u. E.g.

u'Zürich'

To allow Python to decode your source code, you will need to add an encoding header to match the actual encoding of your file. For example, if your file was encoded as 'UTF-8', you would use:

# encoding: utf-8

This is only necessary when you have non-ASCII in your source code.

Files

Usually non-ASCII data is received from a file. The io module provides a TextWrapper that decodes your file on the fly, using a given encoding. You must use the correct encoding for the file - it can't be easily guessed. For example, for a UTF-8 file:

import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
     my_unicode_string = my_file.read() 

my_unicode_string would then be suitable for passing to Markdown. If a UnicodeDecodeError from the read() line, then you've probably used the wrong encoding value.

CSV Files

The Python 2.7 CSV module does not support non-ASCII characters 😩. Help is at hand, however, with https://pypi.python.org/pypi/backports.csv.

Use it like above but pass the opened file to it:

from backports import csv
import io
with io.open("my_utf8_file.txt", "r", encoding="utf-8") as my_file:
    for row in csv.reader(my_file):
        yield row

Databases

Most Python database drivers can return data in Unicode, but usually require a little configuration. Always use Unicode strings for SQL queries.

MySQL

In the connection string add:

charset='utf8',
use_unicode=True

E.g.

>>> db = MySQLdb.connect(host="localhost", user='root', passwd='passwd', db='sandbox', use_unicode=True, charset="utf8")
PostgreSQL

Add:

psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)

HTTP

Web pages can be encoded in just about any encoding. The Content-type header should contain a charset field to hint at the encoding. The content can then be decoded manually against this value. Alternatively, Python-Requests returns Unicodes in response.text.

Manually

If you must decode strings manually, you can simply do my_string.decode(encoding), where encoding is the appropriate encoding. Python 2.x supported codecs are given here: Standard Encodings. Again, if you get UnicodeDecodeError then you've probably got the wrong encoding.

The meat of the sandwich

Work with Unicodes as you would normal strs.

Output

stdout / printing

print writes through the stdout stream. Python tries to configure an encoder on stdout so that Unicodes are encoded to the console's encoding. For example, if a Linux shell's locale is en_GB.UTF-8, the output will be encoded to UTF-8. On Windows, you will be limited to an 8bit code page.

An incorrectly configured console, such as corrupt locale, can lead to unexpected print errors. PYTHONIOENCODING environment variable can force the encoding for stdout.

Files

Just like input, io.open can be used to transparently convert Unicodes to encoded byte strings.

Database

The same configuration for reading will allow Unicodes to be written directly.

Python 3

Python 3 is no more Unicode capable than Python 2.x is, however it is slightly less confused on the topic. E.g the regular str is now a Unicode string and the old str is now bytes.

The default encoding is UTF-8, so if you .decode() a byte string without giving an encoding, Python 3 uses UTF-8 encoding. This probably fixes 50% of people's Unicode problems.

Further, open() operates in text mode by default, so returns decoded str (Unicode ones). The encoding is derived from your locale, which tends to be UTF-8 on Un*x systems or an 8-bit code page, such as windows-1251, on Windows boxes.

Why you shouldn't use sys.setdefaultencoding('utf8')

It's a nasty hack (there's a reason you have to use reload) that will only mask problems and hinder your migration to Python 3.x. Understand the problem, fix the root cause and enjoy Unicode zen. See Why should we NOT use sys.setdefaultencoding("utf-8") in a py script? for further details

Grizzled answered 16/2, 2016 at 22:54 Comment(5)
For someone looking for Python 2 answers, a more useful TLDR: use io.open for reading/writing files, use from __future__ import unicode_literals, configure other data inputs/outputs (e.g., databases) to use unicode.Potbellied
sooo how do we fix it? lol this isn't an issue from writing a script- it's from installing oneBeechnut
@Matthew try setting PYTHONIOENCODING=utf-8. If that doesn't fix it you'll need to contact the script's author to fix their code.Grizzled
What a life-saver. I would have been all over the place trying to figure out what to change where. Issue was with 2 parts of my script (Python 3.x). Opening a file, and configuring my OS (BSD) locale (for the print). Very well-write!Democrat
I'm still having this issue with docker and Python 3.10. For some reason, even though my LANG and LC_ALL were both set to C.UTF-8, and locale does not return an error, python open still doesn't open in UTF-8 by default. The best help I got was to refer to the official documentation. Eventually, I set PYTHONUTF8=1 (see here.) rather than PYTHONIOENCODING. I noticed that though, that with the latest Python (3.12), the documentation was again changed, so check...Chunky
A
509

Finally I got it:

as3:/usr/local/lib/python2.7/site-packages# cat sitecustomize.py
# encoding=utf8  
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

Let me check:

as3:~/ngokevin-site# python
Python 2.7.6 (default, Dec  6 2013, 14:49:02)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.getdefaultencoding()
'utf8'
>>>

The above shows the default encoding of python is utf8. Then the error is no more.

Aardwolf answered 17/1, 2014 at 16:3 Comment(14)
I try this but it couldn't change the encoding permanently. Once quit the python console and start again, the encoding is still the sameElouise
Traceback (most recent call last): File "./foobar.py", line 8, in <module> sys.setdefaultencoding ('utf8') AttributeError: 'module' object has no attribute 'setdefaultencoding'Terrier
Traceback (most recent call last): File "./foobar.py", line 8, in <module> reload(sys) NameError: name 'reload' is not definedTerrier
Thanks! But why do we need to reload sys after importing it?Syrupy
@DmitryNarkevich, because of the Illusive setdefaultencoding function. It is deleted at Python startup since it should never have been a part of a proper release in the first place, apparently.Escalate
Uf, best answer ever. Until this moment I was forced to include "# -- coding: utf-8 --" at the begining of each document. This is way much easier and works as charmCeroplastics
I did this in the end of my settings.py Django configuration and it solved a lot of problems, you made my day!Strainer
It means that you haven't fixed the root cause. You've just patched over any implied conversionGrizzled
@AlastairMcCormack patching is hardly the right word. This solution simply changes Python's default enconding from almost ancient 7-bit ascii to today's utf-8 world. Something that is long overdue IMHO.Nipa
@Nipa Python 3's default encoding is UTF-8 with Unicode strings as the default str, so it's not overdue there. In Python 2.x, Unicode was in a state of transition, so would've been dangerous to assume an encoding when converting bytes to Unicodes. Therefore, Py2's default encoding of ASCII was deliberate choice and why changing the default encoding requires the deliberate hack of reloading sys. The correct way to banish encoding errors in Py2 is to unambiguously decode and encode (byte) strings to Unicode, when conversions are necessary - not just assume strings are UTF-8 encoded.Grizzled
@Nipa also see: anonbadger.wordpress.com/2015/06/16/…Grizzled
Thanks man, this does work, however i had to set the PYTHONSTARTUP env variable to point to the sitecustomize.py file, after that python always knows the encoding is utf-8.Starknaked
It says "NameError: name 'reload' is not defined"Liege
Didn't work for me.. mine is already utf-8 but I still get this error.Tarantella
V
132

This is the classic "unicode issue". I believe that explaining this is beyond the scope of a StackOverflow answer to completely explain what is happening.

It is well explained here.

In very brief summary, you have passed something that is being interpreted as a string of bytes to something that needs to decode it into Unicode characters, but the default codec (ascii) is failing.

The presentation I pointed you to provides advice for avoiding this. Make your code a "unicode sandwich". In Python 2, the use of from __future__ import unicode_literals helps.

Update: how can the code be fixed:

OK - in your variable "source" you have some bytes. It is not clear from your question how they got in there - maybe you read them from a web form? In any case, they are not encoded with ascii, but python is trying to convert them to unicode assuming that they are. You need to explicitly tell it what the encoding is. This means that you need to know what the encoding is! That is not always easy, and it depends entirely on where this string came from. You could experiment with some common encodings - for example UTF-8. You tell unicode() the encoding as a second parameter:

source = unicode(source, 'utf-8')
Verdun answered 15/1, 2014 at 5:4 Comment(10)
it's still a headache.mr GreenAsJade,can u give me a concrete solution?Aardwolf
Are you asking "how can I as a user of this blog avoid this problem?". Or is your question "how can I fix the code so this problem doesn't happen"?Verdun
hi.mr GreenAsJade: i'm asking "how can I fix the code so this problem doesn't happen"?Aardwolf
I added some words about this.Verdun
mr greenasjade: where should i put "source = unicode(source, 'utf-8')"?Aardwolf
In place of source = unicode(source), in File "/usr/local/lib/python2.7/site-packages/markdown/init.py", line 281Verdun
@Aardwolf I hope you got this working. At StackOverflow, the standard practice is to accept an answer if it helped you (by clicking the big tick next to the answer).Verdun
Weird ... after positive feedback for over a year, suddenly two negative votes...Huh?Verdun
use currentFile = open(filename, 'rt', encoding='latin1') or currentFile = open(filename, 'rt', encoding='utf-8') - see here: https://mcmap.net/q/75033/-switching-to-python-3-causing-unicodedecodeerrorSalerno
Nice video, tks for sharing.!Termitarium
D
45

In some cases, when you check your default encoding (print sys.getdefaultencoding()), it returns that you are using ASCII. If you change to UTF-8, it doesn't work, depending on the content of your variable. I found another way:

import sys
reload(sys)  
sys.setdefaultencoding('Cp1252')
Duston answered 5/11, 2014 at 22:1 Comment(4)
ty, this worked for my problem with python throwing UnicodeDecodeError on var = u"""vary large string"""Luigiluigino
AttributeError: module 'sys' has no attribute 'setdefaultencoding'Liege
and reload(sys) is used for that particular reason.Cardio
Worked for me ! THANKS !Langill
A
32

I was searching to solve the following error message:

unicodedecodeerror: 'ascii' codec can't decode byte 0xe2 in position 5454: ordinal not in range(128)

I finally got it fixed by specifying 'encoding':

f = open('../glove/glove.6B.100d.txt', encoding="utf-8")

Wish it could help you too.

Abnegate answered 6/3, 2018 at 12:57 Comment(2)
this solved the error for me when reading/writing .csv files, didn't need any of the other stuff listed in the other answersPassover
I don't understand why the other answers provide so much details... but forget about this simple solution. +10!Gilligan
O
18
"UnicodeDecodeError: 'ascii' codec can't decode byte"

Cause of this error: input_string must be unicode but str was given

"TypeError: Decoding Unicode is not supported"

Cause of this error: trying to convert unicode input_string into unicode


So first check that your input_string is str and convert to unicode if necessary:

if isinstance(input_string, str):
   input_string = unicode(input_string, 'utf-8')

Secondly, the above just changes the type but does not remove non ascii characters. If you want to remove non-ascii characters:

if isinstance(input_string, str):
   input_string = input_string.decode('ascii', 'ignore').encode('ascii') #note: this removes the character and encodes back to string.

elif isinstance(input_string, unicode):
   input_string = input_string.encode('ascii', 'ignore')
Ochone answered 16/8, 2017 at 21:7 Comment(0)
A
17

In order to resolve this on an operating system level in an Ubuntu installation check the following:

$ locale charmap

If you get

locale: Cannot set LC_CTYPE to default locale: No such file or directory

instead of

UTF-8

then set LC_CTYPE and LC_ALL like this:

$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"
Arther answered 20/2, 2019 at 15:53 Comment(0)
S
10

Got a same error and this solved my error. Thanks! python 2 and python 3 differing in unicode handling is making pickled files quite incompatible to load. So Use python pickle's encoding argument. Link below helped me solve the similar problem when I was trying to open pickled data from my python 3.7, while my file was saved originally in python 2.x version. https://blog.modest-destiny.com/posts/python-2-and-3-compatible-pickle-save-and-load/ I copy the load_pickle function in my script and called the load_pickle(pickle_file) while loading my input_data like this:

input_data = load_pickle("my_dataset.pkl")

The load_pickle function is here:

def load_pickle(pickle_file):
    try:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f)
    except UnicodeDecodeError as e:
        with open(pickle_file, 'rb') as f:
            pickle_data = pickle.load(f, encoding='latin1')
    except Exception as e:
        print('Unable to load data ', pickle_file, ':', e)
        raise
    return pickle_data
See answered 29/5, 2019 at 7:20 Comment(1)
it is better to include definition of load_pickle function in your answer.Nimbus
N
9

I find the best is to always convert to unicode - but this is difficult to achieve because in practice you'd have to check and convert every argument to every function and method you ever write that includes some form of string processing.

So I came up with the following approach to either guarantee unicodes or byte strings, from either input. In short, include and use the following lambdas:

# guarantee unicode string
_u = lambda t: t.decode('UTF-8', 'replace') if isinstance(t, str) else t
_uu = lambda *tt: tuple(_u(t) for t in tt) 
# guarantee byte string in UTF8 encoding
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t
_uu8 = lambda *tt: tuple(_u8(t) for t in tt)

Examples:

text='Some string with codes > 127, like Zürich'
utext=u'Some string with codes > 127, like Zürich'
print "==> with _u, _uu"
print _u(text), type(_u(text))
print _u(utext), type(_u(utext))
print _uu(text, utext), type(_uu(text, utext))
print "==> with u8, uu8"
print _u8(text), type(_u8(text))
print _u8(utext), type(_u8(utext))
print _uu8(text, utext), type(_uu8(text, utext))
# with % formatting, always use _u() and _uu()
print "Some unknown input %s" % _u(text)
print "Multiple inputs %s, %s" % _uu(text, text)
# but with string.format be sure to always work with unicode strings
print u"Also works with formats: {}".format(_u(text))
print u"Also works with formats: {},{}".format(*_uu(text, text))
# ... or use _u8 and _uu8, because string.format expects byte strings
print "Also works with formats: {}".format(_u8(text))
print "Also works with formats: {},{}".format(*_uu8(text, text))

Here's some more reasoning about this.

Nipa answered 2/1, 2015 at 17:20 Comment(4)
Hi, in Python 3 the function _u it is'nt working with this value 'Ita£'.Caliper
Ok, where to start on your "reasoning"? print unicode(u'Zürich', encoding="UTF-8") and then complain "But amazingly you can't encode unicode ext into UTF8". unicode() does not encode; it decodes and you can't decode a Unicode - it's decoded already!Grizzled
@AlastairMcCormack You are most welcome to improve the post. If however you prefer to sprinkle your alleged superieriority over everyone else who does not share your opinion and insight, I'm quite frankly not interested. Thank you.Nipa
@Nipa I'm sorry, I didn't mean to come across like a jerk. Worrying about decoding and encoding everytime you use a string in your code is just unnecessary.Grizzled
N
8

This worked for me:

    file = open('docs/my_messy_doc.pdf', 'rb')
Nubile answered 14/6, 2019 at 8:55 Comment(1)
If you are trying to upload a binary file such as image/video/audio, encoding is not needed, you should read bytes "as-is" that is why mode='rb' (read binary) does not require ant encoding parameter.Amative
L
7

Encode converts a unicode object in to a string object. I think you are trying to encode a string object. first convert your result into unicode object and then encode that unicode object into 'utf-8'. for example

    result = yourFunction()
    result.decode().encode('utf-8')
Lang answered 18/9, 2017 at 14:56 Comment(0)
P
7

I had the same error, with URLs containing non-ascii chars (bytes with values > 128), my solution:

url = url.decode('utf8').encode('utf-8')

Note: utf-8, utf8 are simply aliases . Using only 'utf8' or 'utf-8' should work in the same way

In my case, worked for me, in Python 2.7, I suppose this assignment changed 'something' in the str internal representation--i.e., it forces the right decoding of the backed byte sequence in url and finally puts the string into a utf-8 str with all the magic in the right place. Unicode in Python is black magic for me. Hope useful

Palumbo answered 20/7, 2018 at 21:0 Comment(3)
Why a dash in one and not the otherMicro
Python accepts aliases for encoding names, I have tried now, and performed the same... simply I have not noticed that I wrote them differently, added notePalumbo
This solved my problem as well.Christman
P
5

I had the same problem but it didn't work for Python 3. I followed this and it solved my problem:

enc = sys.getdefaultencoding()
file = open(menu, "r", encoding = enc)

You have to set the encoding when you are reading/writing the file.

Philippeville answered 16/8, 2017 at 20:12 Comment(0)
B
3

I got the same problem with the string "Pastelería Mallorca" and I solved with:

unicode("Pastelería Mallorca", 'latin-1')
Bethink answered 29/5, 2017 at 8:47 Comment(0)
P
3

In short, to ensure proper unicode handling in Python 2:

  • use io.open for reading/writing files
  • use from __future__ import unicode_literals
  • configure other data inputs/outputs (e.g., databases, network) to use unicode
  • if you cannot configure outputs to utf-8, convert your output for them print(text.encode('ascii', 'replace').decode())

For explanations, see @Alastair McCormack's detailed answer.

Potbellied answered 22/2, 2018 at 17:43 Comment(1)
• use io.open(path, 'r', encoding='utf-8') to read utf-8-encoded files.Yuk
S
1

In a Django (1.9.10)/Python 2.7.5 project I have frequent UnicodeDecodeError exceptions; mainly when I try to feed unicode strings to logging. I made a helper function for arbitrary objects to basically format to 8-bit ascii strings and replacing any characters not in the table to '?'. I think it's not the best solution but since the default encoding is ascii (and i don't want to change it) it will do:

def encode_for_logging(c, encoding='ascii'):
    if isinstance(c, basestring):
        return c.encode(encoding, 'replace')
    elif isinstance(c, Iterable):
        c_ = []
        for v in c:
            c_.append(encode_for_logging(v, encoding))
        return c_
    else:
        return encode_for_logging(unicode(c))
`
Scorpio answered 13/1, 2017 at 9:44 Comment(0)
S
1

This error occurs when there are some non ASCII characters in our string and we are performing any operations on that string without proper decoding. This helped me solve my problem. I am reading a CSV file with columns ID,Text and decoding characters in it as below:

train_df = pd.read_csv("Example.csv")
train_data = train_df.values
for i in train_data:
    print("ID :" + i[0])
    text = i[1].decode("utf-8",errors="ignore").strip().lower()
    print("Text: " + text)
Shaefer answered 26/7, 2018 at 6:47 Comment(0)
B
0

Specify: # encoding= utf-8 at the top of your Python File, It should fix the issue

Barbed answered 4/2, 2019 at 5:24 Comment(0)
S
0

I experienced this error with Python2.7. It happened to me while trying to run many python programs, but I managed to reproduce it with this simple script:

#!/usr/bin/env python

import subprocess
import sys

result = subprocess.Popen([u'svn', u'info'])
if not callable(getattr(result, "__enter__", None)) and not callable(getattr(result, "__exit__", None)):
    print("foo")
print("bar")

On success, it should print out 'foo' and 'bar', and probably an error message if you're not in a svn folder.

On failure, it should print 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 39: ordinal not in range(128)'.

After trying to regenerate my locales and many other solutions posted in this question, I learned the error was happening because I had a special character (ĺ) encoded in my PATH environment variable. After fixing the PATH in '~/.bashrc', and exiting my session and entering again, (apparently sourcing '~/.bashrc' didn't work), the issue was gone.

Sunglass answered 25/1, 2021 at 14:23 Comment(0)
L
-1

Here is my solution, just add the encoding. with open(file, encoding='utf8') as f

And because reading glove file will take a long time, I recommend to the glove file to a numpy file. When netx time you read the embedding weights, it will save your time.

import numpy as np
from tqdm import tqdm


def load_glove(file):
    """Loads GloVe vectors in numpy array.
    Args:
        file (str): a path to a glove file.
    Return:
        dict: a dict of numpy arrays.
    """
    embeddings_index = {}
    with open(file, encoding='utf8') as f:
        for i, line in tqdm(enumerate(f)):
            values = line.split()
            word = ''.join(values[:-300])
            coefs = np.asarray(values[-300:], dtype='float32')
            embeddings_index[word] = coefs

    return embeddings_index

# EMBEDDING_PATH = '../embedding_weights/glove.840B.300d.txt'
EMBEDDING_PATH = 'glove.840B.300d.txt'
embeddings = load_glove(EMBEDDING_PATH)

np.save('glove_embeddings.npy', embeddings) 

Gist link: https://gist.github.com/BrambleXu/634a844cdd3cd04bb2e3ba3c83aef227

Longoria answered 11/9, 2018 at 6:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.