Pandas importing CSV and Excel file error
Asked Answered
T

1

8

I am trying to use Python Pandas to import a CSV file. The example data in this file is as follows where the first row is the column names separated by commas.

End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date 
11027676,Baroda Western Uttar Pradesh Gramin Bankgfhgfnjgfnmjmhgmghmghmghmnghnmghnmhgnmghnghngh,4078446,Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluilui;iooi';po'fserwefvegwegf,1809012,"Hcl Infosystems Ltd - Partnerdghftrutyhb frhywer5y5tyu6ui7iukluyj,lgjmfgnhfrgweffw",Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmghmbhmgfngdfbndfhtgh,SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56juhydsgfaefwe,SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfawrqwerwegtrhyjuytjhyj,SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmjcbfuigkjasbcjkasbkdfhiwh,2005,Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasdbcvjkxsbhg,Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgfoisdhyguiserhguisrh,"Open Stddfm,vdnoghioerivnsdflierohgushdfovhsiodghuiohdbvgsjdhgouiwerho",125.85,1,FY07,12/28/2006
12835756,Uttam Strips Pvt Ltd,12835756,Uttam Strips Pvt Ltd,12565538,Redington C/O Fortis Financial Services Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,9/15/2008
12233135,Bhagwan Singh Tondon,12233135,Bhagwan Singh Tondon,2652941,H B S Systems Pvt Ltd,Server & CAL,SQL Server & CAL,SQL CAL,SQL CAL,Non-specific,Open,Open L&SA,Deferred Open L&SA - New,0,0,FY09,9/15/2008
11602305,Maya Academy Of Advanced Cinematics,9750934,Maya Entertainment Ltd,336146,Embee Software Pvt Ltd,Server & CAL,Windows Server & CAL,Windows Server HPC,Windows Compute Cluster Server,Non-specific,Open,Open V/MYO - Rec,OLV Perpet L&SA Recur-Def,0,0,FY09,9/25/2008
13336009,Remiel Softech Solution Pvt Ltd,13336009,Remiel Softech Solution Pvt Ltd,13335482,Redington C/O Remiel Softech Solutions Pvt Ltd,MBS,Dynamics ERP,Dynamics NAV,Dynamics NAV Business Essentials,Non-specific,Other,MBS SA,MBS New Customer Enhanc. Def,0,0,FY09,12/23/2008

I am using the below code to import:

import pandas as pd

df=pd.read_csv('file path.csv',sep=',')

It gave the following error:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    df=pd.read_csv(filename,sep=',')
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 205, in _read
    return parser.read()
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas\parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas\parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10642)
  File "parser.pyx", line 1046, in pandas.parser.TextReader._string_convert (pandas\parser.c:10853)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas\parser.c:15657)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc1 in position 90: invalid start byte

Since it looked like a Unicode error I ran with encoding changed this time:

df=pd.read_csv(filename,encoding='utf-16',sep=',')

It gave the following error:

Traceback (most recent call last):
  File "<pyshell#26>", line 1, in <module>
    df=pd.read_csv(filename,encoding='utf-16',sep=',')
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 198, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 479, in __init__
    self._make_engine(self.engine)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 586, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "C:\Python33\lib\site-packages\pandas\io\parsers.py", line 957, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "parser.pyx", line 477, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4434)
  File "parser.pyx", line 592, in pandas.parser.TextReader._get_header (pandas\parser.c:5660)
  File "parser.pyx", line 768, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:7451)
  File "parser.pyx", line 1661, in pandas.parser.raise_parser_error (pandas\parser.c:18744)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

Not sure why is this happening? Even tried converting the CSV file into Excel with Text to Columns and used read_excel function of Pandas. That too gave the error(below):

Traceback (most recent call last):
  File "<pyshell#30>", line 1, in <module>
    df=pd.read_excel('J:\dmqp on 192.168.1.41\MS Sales Dump (FY09)xls','MS Sales Dump (FY09)')
  File "C:\Python33\lib\site-packages\pandas\io\excel.py", line 52, in read_excel
    return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname,
  File "C:\Python33\lib\site-packages\pandas\io\excel.py", line 68, in __init__
    import xlrd # throw an ImportError if we need to
ImportError: No module named 'xlrd'

Can someone help with the above errors and what is wrong here while importing both as CSV and as Excel.

I tried this code with encoding changed:

df=pd.read_csv(filename,encoding='iso-8859-1',sep=',')

It didn't give any error but imported as one single column rather than breaking it to separate columns.

>>>df
<class 'pandas.core.frame.DataFrame'>
Int64Index: 263244 entries, 0 to 263243
Data columns (total 1 columns):
End Customer Organization ID,End Customer Organization Name,End Customer Top Parent Organization ID,End Customer Top Parent Organization Name,Reseller Top Parent ID,Reseller Top Parent Name,Business,Rev Sum Division,Rev Sum Category,Product Family,Version,Pricing Level,Summary Pricing Level,Detail Pricing Level,MS Sales Amount,MS Sales Licenses,Fiscal Year,Sales Date    263244  non-null values
dtypes: object(1)

After just checking for the above example data by storing it in a text file and then importing this is the output I got:

>>> df =pd.read_csv(r'J:\Data.txt')
>>> print(df)
   End Customer Organization ID  \
0                      11027676   
1                      12835756   
2                      12233135   
3                      11602305   
4                      13336009   

                      End Customer Organization Name  \
0  Baroda Western Uttar Pradesh Gramin Bankgfhgfn...   
1                               Uttam Strips Pvt Ltd   
2                               Bhagwan Singh Tondon   
3                Maya Academy Of Advanced Cinematics   
4                    Remiel Softech Solution Pvt Ltd   

   End Customer Top Parent Organization ID  \
0                                  4078446   
1                                 12835756   
2                                 12233135   
3                                  9750934   
4                                 13336009   

           End Customer Top Parent Organization Name  Reseller Top Parent ID  \
0  Bank Of Barodadfhhgfjyjtkyukujkyujkuhykluiluil...                 1809012   
1                               Uttam Strips Pvt Ltd                12565538   
2                               Bhagwan Singh Tondon                 2652941   
3                             Maya Entertainment Ltd                  336146   
4                    Remiel Softech Solution Pvt Ltd                13335482   

                            Reseller Top Parent Name  \
0  Hcl Infosystems Ltd - Partnerdghftrutyhb frhyw...   
1        Redington C/O Fortis Financial Services Ltd   
2                              H B S Systems Pvt Ltd   
3                             Embee Software Pvt Ltd   
4     Redington C/O Remiel Softech Solutions Pvt Ltd   

                                            Business  \
0  Server & CALsdgrgrfgtrhytrnhjdgthjtyjkukmhjmgh...   
1                                                MBS   
2                                       Server & CAL   
3                                       Server & CAL   
4                                                MBS   

                                    Rev Sum Division  \
0  SQL Server & CALdfhtrhtrgbhrghrye5y45y45yu56ju...   
1                                       Dynamics ERP   
2                                   SQL Server & CAL   
3                               Windows Server & CAL   
4                                       Dynamics ERP   

                                    Rev Sum Category  \
0  SQL CALdhdfthtrutrjurhjethfdehrerfgwerweqeadfa...   
1                                       Dynamics NAV   
2                                            SQL CAL   
3                                 Windows Server HPC   
4                                       Dynamics NAV   

                                      Product Family       Version  \
0  SQL CALdtrye45y3t434tjkabcjkasdhfhasdjkcbaksmj...          2005   
1                   Dynamics NAV Business Essentials  Non-specific   
2                                            SQL CAL  Non-specific   
3                     Windows Compute Cluster Server  Non-specific   
4                   Dynamics NAV Business Essentials  Non-specific   

                                       Pricing Level  \
0  Openfkvgjesropiguwe90fujklascnioawfy98eyfuiasd...   
1                                              Other   
2                                               Open   
3                                               Open   
4                                              Other   

                               Summary Pricing Level  \
0  Open Lklbjdfoigueroigbjvwioergyuiowerhgosdhvgf...   
1                                             MBS SA   
2                                          Open L&SA   
3                                   Open V/MYO - Rec   
4                                             MBS SA   

                                Detail Pricing Level  MS Sales Amount  \
0  Open Stddfm,vdnoghioerivnsdflierohgushdfovhsio...           125.85   
1                       MBS New Customer Enhanc. Def             0.00   
2                           Deferred Open L&SA - New             0.00   
3                          OLV Perpet L&SA Recur-Def             0.00   
4                       MBS New Customer Enhanc. Def             0.00   

   MS Sales Licenses Fiscal Year Sales Date   
0                  1        FY07  12/28/2006  
1                  0        FY09   9/15/2008  
2                  0        FY09   9/15/2008  
3                  0        FY09   9/25/2008  
4                  0        FY09  12/23/2008  
>>> 

This is adding '\' after each column and column names are not one after the other. Instead they seem to be on new line after each column is imported.

Tingly answered 10/10, 2013 at 10:29 Comment(14)
What version of Pandas are you using, I copied your data into a text and the following worked with no errors: df=pd.read_csv(r'c:\temp\data.txt') so used no parameters and it worked, this was on Python 2.7 64-bitDitto
@edChum I am using Python 3.3.2 on Windown 64 bit machine with Pandas 0.12.0 installed.Tingly
Worked for me also on Python 3.3.2.1 with Pandas 0.12.0, there is possibly some encoding issues in your data that is not shown here in your sample data. What happens when you copy the data from this post into a text file and use read_csv again, does it work?Ditto
I used the above data. It imports but the output is of this format: End Customer Organization ID \ 0 11027676 1 12835756 2 12233135 3 11602305 4 13336009 End Customer Organization Name \ 0 Baroda Western Uttar Pradesh Gramin Bankgfhgfn... 1 Uttam Strips Pvt Ltd 2 Bhagwan Singh Tondon 3 Maya Academy Of Advanced CinematicsTingly
@Ditto I have added the output in the question above. please have a look. It gives '\' and in different lines.Tingly
are you complaining about the output? This is a result of Pandas, by default if the column width or display width is exceeded then the output is spanned across multiple lines. The dataframe is still correct. You can change the output settings pd.set_option('display.width', 1000) and pd.set_option('display.max_colwidth', 100), you can modify the numbers as you see fit, if you are in IPython then tab completion on set_option will display the docstring and if you press tab 3 times it will display the current settingsDitto
Thanks Ed. Just a last couple of questions.1). This seems to work fine for .txt file. But why isn't working fine for .csv or .xls file as mentioned in the above question? Secondly is there a way where we can store a data frame in Python as a table (data set ) and see all the variables if correctly imported rather than seeing the output getting displayed in the shell which doesn't show the whole output properly formatted. Like in SAS or other tools if we import a flat file it creates a new data set for it which is saved on a physical space rather than memory like a data frame. Is that possible?Tingly
1. There is no difference between txt and csv, I would guess there is a problem with encoding, if you changed the extension from .txt to .csv it would still work, for excel .xls you do not have xlrd module installed is the principal error so you need that installed. 2. I'm not sure what you mean, Pandas stores and reads the data correctly (so long as it can deduce the data types), you can look at pytables as alternative, you can modify the display options using set_option like I said, or you can use h5py to store and retrieve data. There is nothing stopping you from loading csv with exclDitto
Thanks. Any idea why it gives the above two errors while importing cvs file with UTF8 and UTF16 encoding? For UTF16 it says error with C parser while with UTF8 it talks about encoding issue as highlighted above. For txt it didn't give any error.Tingly
unfortunately I don't know. It should work as your version of Pandas is not that old see this, it may be worth investigating some of the answers there to see if it helpsDitto
@Ditto One Problem I see in the output for dataframe in the console even after maxing out rows and column width is that its not properly formatted. The column values are not properly aligned under the same column and is just like a free flowing text. That makes it difficult to look at the data. Any idea how we overcome that?Tingly
unfortunately I don't know, you could try turning on HTML table representation set_option('display.notebook_repr_html', True) but this could be unwieldly due to the size of the data, beyond that I don't know.Ditto
@Ditto Thanks. The problem which I see is that the same complete file with comma delimited value throws the UnicodeDecode error as mentioned above. When I use the encoding='CP1252' in read_csv function it doesnt throw an error but reads all columns as one single column. Interestingly I have a similar file which is tab delimited instead of comma delimited. It was giving the same UnicodeDecode error when importing using read_csv. But when used the same encoding='CP1252',sep='\t' it imports correctly without any errors. So for tab del this encoding works but not for comma del. Any idea on this?Tingly
unfortunately I don't it is worth adding this information to your question rather than it living only in the comments but I don't know if people more knowledgeable will be look at this question again. It seems to me that this is an encoding error so you could post a new question, it would help to post a few lines of data so that others can look at it.Ditto
B
3

I guess your main problem has to do with encoding. I have suffered the pain of dealing with weird encodings in csv files. What helped me in those cases was to try to detect the real encoding of the file and load it correctly with pandas.

give this next code a try:

from chardet.universaldetector import UniversalDetector

def test_encoding(file_name):
    detector = UniversalDetector()
    with open(file_name, 'rb') as f:
        for line in f:
            detector.feed(line)
            if detector.done:
                 break
        detector.close()
    r = detector.result
    return "Detected encoding %s with confidence %s" % (r['encoding'], r['confidence'])

# pass the file path in the function to see result
test_encoding('C:\Users\..\file.csv')

Output:

'Detected encoding UTF-16 with confidence 1.0'

This will try to infer the encoding of your file and than you can try to load it correctly using pandas. Hope it helps...

Banjo answered 6/5, 2016 at 13:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.