How to open excel file fast in Python?
Asked Answered
P

6

8

I am now using PyExcelerator for reading excel files, but it is extremely slow. As I always need to open excel files more than 100MB, it takes me more than twenty minutes to only load one file.

The functionality I need are:

  • Open Excel Files, Select Specific Tables, And Load Them Into a Dict or List object.
  • Sometimes: Select Specific Columns And Only Load Whole Lines Which Have the Specific Columns in Specific Values.
  • Read Excel Files With Password Protected.

And the code I am using now is:

book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
result_list = []
number_of_rows = 500000
for i in range(0, number_of_rows):
    ok = False
    result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if type(item) is StringType or type(item) is UnicodeType:
            item = item.replace("\t","").strip()
        result_list[i].append(item)
        if item != '':
            ok = True
    if not ok:
        break

Any suggestions?

Prototype answered 3/5, 2011 at 4:12 Comment(6)
Have you tried other libraries yet? (I have no technical knowledge about this subject, I'm just interested)Toxicology
Yes i tried, but those always have no functionality with writing xls. After reading the big xlses I have to do some calculating and save results to a small xls too.Prototype
@FelixYan: Ok good to know, hope you get some good answers!Toxicology
For the writing part you could use xlwt or, if you're just writing values, you could use CSV format (which can be easily imported into Excel).Umeh
Does the 20 minutes include just the pyExcelerator.parse_xls() or are you counting your own subsequent code?Fehr
Only the pyExcelerator.parse_xls() takes more than 20 minutes.Prototype
F
6

pyExcelerator appears not to be maintained. To write xls files, use xlwt, which is a fork of pyExcelerator with bug fixes and many enhancements. The (very basic) xls reading capability of pyExcelerator was eradicated from xlwt. To read xls files, use xlrd.

If it's taking 20 minutes to load a 100MB xls file, you must be using one or more of: a slow computer, a computer with very little available memory, or an older version of Python.

Neither pyExcelerator nor xlrd read password-protected files.

Here's a link that covers xlrd and xlwt.

Disclaimer: I'm the author of xlrd and maintainer of xlwt.

Fehr answered 3/5, 2011 at 4:34 Comment(1)
Thank you and I'll try these two. In fact I am using AMD Phenom II X4 945 with 4G RAM and 2G or more of them are free, SSD, and Python 2.7 in a x86_64 Linux OS. The reading process can be even slower somewhere else.Prototype
A
3

xlrd is pretty good for reading files and xlwt is pretty good for writing. Both superior to pyExcelerator in my experience.

Amphichroic answered 3/5, 2011 at 4:19 Comment(0)
M
1

You could try to pre-allocate the list to its size in a single statement instead of appending one item at a time like this: (one large allocation of memory should be faster than many small ones)

book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
number_of_rows = 500000
result_list = [] * number_of_rows 
for i in range(0, number_of_rows):
    ok = False
    #result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if type(item) is StringType or type(item) is UnicodeType:
            item = item.replace("\t","").strip()
        result_list[i].append(item)
        if item != '':
            ok = True
    if not ok:
        break

If doing this gives appreciable performance increase you could also try to preallocate each list item with the number of columns and then assign them by index rather than appending one value at a time. Here's a snippet that creates a 10x10, two-dimensional list in a single statement with an initial value of 0:

L = [[0] * 10 for i in range(10)]

So folded into your code, it might work something like this:

book = pyExcelerator.parse_xls(filepath)
parsed_dictionary = defaultdict(lambda: '', book[0][1])
number_of_columns = 44
number_of_rows = 500000
result_list = [[''] * number_of_rows for x in range(number_of_columns)]
for i in range(0, number_of_rows):
    ok = False
    #result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if type(item) is StringType or type(item) is UnicodeType:
            item = item.replace("\t","").strip()
        result_list[i,h] = item
        if item != '':
            ok = True
    if not ok:
        break
Memphian answered 3/5, 2011 at 4:33 Comment(7)
The problem is, I don't know the size of the xls file. So the number_of_rows variable is only the maximum size I guess. So ... will the pre-allocate take too much memory?Prototype
You know the number of columns but not the rows? Is the column count fixed? In any case, it might be worth a try. Do a performance comparison of subsets with the two different algorithms, on say 1000 rows. You can gauge from there.Memphian
Thank you. It worths a try of course :P. And you are right, the columns count is fixed, and I don't know the number of rows.Prototype
You could establish the number of rows by iterating until item != '' and just incrementing a counter. It would be an extra step before the assignment but that would establish the upper bound.Memphian
Another thing you could do is look at your process with a Performance Monitoring app. Watch for memory growth and especially page faults. Page faults will smother a nested loop like yours and preallocation is a good way to get around them.Memphian
Hmm... If I get the right idea, you mean doing the iterating two times is faster than doing allocating many times? Thank you and I'll try that!Prototype
Yes, but remember that the initial iteration should be very fast only meant to loop and count the rows and not perform any other operation except incrementing an integer.Memphian
C
1

Unrelated to your question: If you're trying to check if none of the columns are empty string, then you set ok = True initially, and do this instead in the inner loop (ok = ok and item != ''). Also, you can just use isinstance(item, basestring) to test whether a variable is string or not.

Revised version

for i in range(0, number_of_rows):
    ok = True
    result_list.append([])
    for h in range(0, number_of_columns):
        item = parsed_dictionary[i,h]
        if isinstance(item, basestring):
            item = item.replace("\t","").strip()
        result_list[i].append(item)
        ok = ok and item != ''

    if not ok:
        break
Coextend answered 3/5, 2011 at 4:41 Comment(1)
Thank you! I have been not comfortable with the type(item) is StringType or type(item) is UnicodeType thing for so long! But I don't think the ok = ok and item != '' easy to read afterwards, only a little hacky :)Prototype
P
0

I built a library recently that may be of interest: https://github.com/ktr/sxl. It essentially tries to "stream" Excel files like Python does with normal files and is therefore very fast when you only need a subset of data (esp. if it is near the beginning of the file).

Pushing answered 14/8, 2018 at 22:8 Comment(1)
Interesting library. I have a case in a project when I need to load a lot of Excel files in a table. I will try to use it and write the results here to help the community.Kindless
C
0

For my library/script tabxlsx I have compared read/write times with openpyxl and with my own code that works directly on xml data in the Excel zip. One million numbers get stored in 6 megabyte xlsx file. Conversion to json took 40 and 55 seconds respectivly. That shows the problem is in the Excel format which needs a lot of time in zlib decompression and xml element parsing before you can even convert the numbers to native python types.

On the other hand, here we can make an estimation for the lower bound of reading 100MB excel data being 5 minutes. That because I have the minimal code for decoding the Excel format. You could only get if faster than that when having an excel parser being implemented fully in a C module but even then I think the speedup would be minimal as CPython's zlib and xml reader are already written in C.

Here's my script for further testing. And it is probably the fastest solution currently available for loading xlsx data into python native data types: https://pypi.org/project/tabxlsx/

Cortney answered 5/9 at 7:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.