How do I read an Excel file into Python using xlrd? Can it read newer Office formats?
Asked Answered
I

5

11

My issue is below but would be interested comments from anyone with experience with xlrd.

I just found xlrd and it looks like the perfect solution but I'm having a little problem getting started. I am attempting to extract data programatically from an Excel file I pulled from Dow Jones with current components of the Dow Jones Industrial Average (link: http://www.djindexes.com/mdsidx/?event=showAverages)

When I open the file unmodified I get a nasty BIFF error (binary format not recognized)

However you can see in this screenshot that Excel 2008 for Mac thinks it is in 'Excel 1997-2004' format (screenshot: http://skitch.com/alok/ssa3/componentreport-dji.xls-properties)

If I instead open it in Excel manually and save as 'Excel 1997-2004' format explicitly, then open in python usig xlrd, everything is wonderful. Remember, Office thinks the file is already in 'Excel 1997-2004' format. All files are .xls

Here is a pastebin of an ipython session replicating the issue: http://pastie.textmate.org/private/jbawdtrvlrruh88mzueqdq

Any thoughts on: How to trick xlrd into recognizing the file so I can extract data? How to use python to automate the explicit 'save as' format to one that xlrd will accept? Plan B?

Illusory answered 23/9, 2008 at 0:58 Comment(0)
S
26

FWIW, I'm the author of xlrd, and the maintainer of xlwt (a fork of pyExcelerator). A few points:

  1. The file ComponentReport-DJI.xls is misnamed; it is not an XLS file, it is a tab-separated-values file. Open it with a text editor (e.g. Notepad) and you'll see what I mean. You can also look at the not-very-raw raw bytes with Python:

    >>> open('ComponentReport-DJI.xls', 'rb').read(200)
    'COMPANY NAME\tPRIMARY EXCHANGE\tTICKER\tSTYLE\tICB SUBSECTOR\tMARKET CAP RANGE\
    tWEIGHT PCT\tUSD CLOSE\t\r\n3M Co.\tNew York SE\tMMM\tN/A\tDiversified Industria
    ls\tBroad\t5.15676229508\t50.33\t\r\nAlcoa Inc.\tNew York SE\tA'
    

    You can read this file using Python's csv module ... just use delimiter="\t" in your call to csv.reader().

  2. xlrd can read any file that pyExcelerator can, and read them better—dates don't come out as floats, and the full story on Excel dates is in the xlrd documentation.

  3. pyExcelerator is abandonware—xlrd and xlwt are alive and well. Check out http://groups.google.com/group/python-excel

HTH John

Sizing answered 23/9, 2008 at 0:58 Comment(0)
V
3

xlrd support for Office 2007/2008 (OpenXML) format is in alpha test - see the following post in the python-excel newsgroup: http://groups.google.com/group/python-excel/msg/0c5f15ad122bf24b?hl=en

Vertebrate answered 23/9, 2008 at 0:58 Comment(2)
John Machin has released this publicly (to the mailing list) and has expressed concrete plans to have this integrated into a new full release of xlrd around the end of the year.Selfconsistent
This has been released a few days ago in v0.8: groups.google.com/forum/?fromgroups#!topic/python-excel/…Porch
C
1

More info on pyExcelerator: To read a file, do this:

import pyExcelerator
book = pyExcelerator.parse_xls(filename)

where filename is a string that is the filename to read (not a file-like object). This will give you a data structure representing the workbook: a list of pairs, where the first element of the pair is the worksheet name and the second element is the worksheet data.

The worksheet data is a dictionary, where the keys are (row, col) pairs (starting with 0) and the values are the cell contents -- generally int, float, or string. So, for instance, in the simple case of all the data being on the first worksheet:

data = book[0][1]
print 'Cell A1 of worksheet %s is: %s' % (book[0][0], repr(data[(0, 0)]))

If the cell is empty, you'll get a KeyError. If you're dealing with dates, they may (I forget) come through as integers or floats; if this is the case, you'll need to convert. Basically the rule is: datetime.datetime(1899, 12, 31) + datetime.timedelta(days=n) but that might be off by 1 or 2 (because Excel treats 1900 as a leap-year for compatibility with Lotus, and because I can't remember if 1900-1-1 is 0 or 1), so do some trial-and-error to check. Datetimes are stored as floats, I think (days and fractions of a day).

I think there is partial support for forumulas, but I wouldn't guarantee anything.

Calv answered 24/9, 2008 at 2:5 Comment(1)
pyExcelerator has a number of known bugs - the 'xlrd' fork is compatible and maintained.Senhor
S
0

Well here is some code that I did: (look down the bottom): here

Not sure about the newer formats - if xlrd can't read it, xlrd needs to have a new version released !

Smug answered 23/9, 2008 at 1:14 Comment(0)
C
-1

Do you have to use xlrd? I just downloaded 'UPDATED - Dow Jones Industrial Average Movers - 2008' from that website and had no trouble reading it with pyExcelerator.

import pyExcelerator
book = pyExcelerator.parse_xls('DJIAMovers.xls')
Calv answered 23/9, 2008 at 2:33 Comment(1)
There doesn't seem to be much documentation available for pyExcelerator. Could you maybe show me the basics of opening a file and extracting data?Illusory

© 2022 - 2024 — McMap. All rights reserved.