UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte, while reading csv file in pandas

B

5

42

I know similar questions has been asked already I have seen all of them and tried but of little help. I am using OSX 10.11 El Capitan, python3.6., virtual environment, tried without that also. I am using jupyter notebook and spyder3.

I am new to python, but know basic ML and following a post to learn how to solve Kaggle challenges: Link to Blog, Link to Data Set

.I am stuck at the first few lines of code `

import pandas as pd

destinations = pd.read_csv("destinations.csv")
test = pd.read_csv("test.csv")
train = pd.read_csv("train.csv")

and it is giving me error

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-19-a928a98eb1ff> in <module>()
      1 import pandas as pd
----> 2 df = pd.read_csv('destinations.csv', compression='infer',date_parser=True, usecols=([0,1,3]))
      3 df.head()

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    403 
    404     # Create the parser.
--> 405     parser = TextFileReader(filepath_or_buffer, **kwds)
    406 
    407     if chunksize or iterator:

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    762             self.options['has_index_names'] = kwds['has_index_names']
    763 
--> 764         self._make_engine(self.engine)
    765 
    766     def close(self):

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    983     def _make_engine(self, engine='c'):
    984         if engine == 'c':
--> 985             self._engine = CParserWrapper(self.f, **self.options)
    986         else:
    987             if engine == 'python':

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1603         kwds['allow_leading_cols'] = self.index_col is not False
   1604 
-> 1605         self._reader = parsers.TextReader(src, **kwds)
   1606 
   1607         # XXX

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__ (pandas/_libs/parsers.c:6175)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._get_header (pandas/_libs/parsers.c:9691)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Some answers on stakoverflow suggested that it is because it is gzipped, but Chrome downloaded the .csv file and .csv.gz was nowhere to be seen and returned file not found error.

I then read somewhere to use encoding='latin1', but after doing this I am getting parser error:

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
<ipython-input-21-f9c451f864a2> in <module>()
      1 import pandas as pd
      2 
----> 3 destinations = pd.read_csv("destinations.csv",encoding='latin1')
      4 test = pd.read_csv("test.csv")
      5 train = pd.read_csv("train.csv")

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    653                     skip_blank_lines=skip_blank_lines)
    654 
--> 655         return _read(filepath_or_buffer, kwds)
    656 
    657     parser_f.__name__ = name

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    409 
    410     try:
--> 411         data = parser.read(nrows)
    412     finally:
    413         parser.close()

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1003                 raise ValueError('skipfooter not supported for iteration')
   1004 
-> 1005         ret = self._engine.read(nrows)
   1006 
   1007         if self.options.get('as_recarray'):

/usr/local/lib/python3.6/site-packages/pandas/io/parsers.py in read(self, nrows)
   1746     def read(self, nrows=None):
   1747         try:
-> 1748             data = self._reader.read(nrows)
   1749         except StopIteration:
   1750             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)()

ParserError: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3

I have spent hours to debug this, tried to open the csv files on Atom( no other app could open it), online web-apps(some crashed) but of no help.I have tried using the kernels of other people who have solved the problem, but of no help.

Buchner answered 20/6, 2017 at 17:47 Comment(2)

What's the separator? – Tris 20/6, 2017 at 17:54

I don't know. I am new to all these. I just downloaded the dataset as was given in the post and tried to execute the lines, but got an error.I don't know how to know the separator, I have mentioned the link at the top maybe you can find. Thanks – Buchner 20/6, 2017 at 17:57

S

85

It's still most likely gzipped data. gzip's magic number is 0x1f 0x8b, which is consistent with the UnicodeDecodeError you get.

You could try decompressing the data on the fly:

with open('destinations.csv', 'rb') as fd:
    gzip_fd = gzip.GzipFile(fileobj=fd)
    destinations = pd.read_csv(gzip_fd)

Or use pandas' built-in gzip support:

destinations = pd.read_csv('destinations.csv', compression='gzip')

Sector answered 20/6, 2017 at 18:4 Comment(5)

Thanks! It worked.But I wanted to know one thing why it happened with me, other people who did this didn't get the error. Like see one submission: kaggle.com/benjaminabel/pandas-version-of-most-popular-hotels – Buchner 20/6, 2017 at 18:15

I would assume that submission only works on input files that are already unzipped. – Sector 20/6, 2017 at 18:29

But my file show .csv extension same as what chrome downloaded. So how is it possible for it to be zipped? Shouldn't it be .csv.gz? – Buchner 20/6, 2017 at 18:47

Look I don't know about the specifics of your or that other guy's browser. The thing that matters here is that if the file is gzipped, you need to decompress it before you feed it to pandas. – Sector 21/6, 2017 at 9:8

If you came here thanks to the "gzip" and "0x8b" keywords but are trying to import data into Postgres: just set the metadata encoding to gzip :) (see https://mcmap.net/q/391512/-ingesting-gzip-file-from-s3-to-postgres-invalid-byte-sequence-for-encoding-quot-utf8-quot) – Bracer 20/7, 2022 at 10:35

L

6

Try including this encoding while reading the csv file

pd.read_csv('csv_file', encoding='ISO-8859–1')

Lampblack answered 21/1, 2021 at 20:37 Comment(1)

Did not work for my case. Any explanation on the encoding value used? – Religious 9/3, 2022 at 8:51

T

3

Can you try using codecs

import codecs
with codecs.open("destinations.csv", "r",encoding='utf-8', errors='ignore') as file_dat:
     destinations = pd.read_csv(file_data))

Trinidad answered 20/6, 2017 at 18:0 Comment(1)

I tried but getting error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 11, saw 3 – Buchner 20/6, 2017 at 18:8

B

0

I had the same issue although the problem was with line endings. The file had Windows line endings (CRLF) but was being manipulated in a Linux machine. Using VSCode to set the correct line endings fixed the issue.

Bekah answered 17/5, 2023 at 12:33 Comment(0)

I

0

I am learning python; I had the same problem while reading CSV file through pandas. In my case I have used latin1 encoding to fix the issue. pd.read_csv('Dataset.csv',encoding = 'latin1') -- Worked While I am trying to find the solution I also learnt, encoding 8859-1 same as latin1. Every character is encoded as a single byte. There are 191 characters total. pd.read_csv('Dataset.csv',encoding = 'ISO 8859-1') -- Worked

Implicate answered 23/9, 2023 at 22:39 Comment(0)

Recommended topics

Hot tags