Faster way to read Excel files to pandas dataframe
Asked Answered
E

10

60

I have a 14MB Excel file with five worksheets that I'm reading into a Pandas dataframe, and although the code below works, it takes 9 minutes!

Does anyone have suggestions for speeding it up?

import pandas as pd

def OTT_read(xl,site_name):
    df = pd.read_excel(xl.io,site_name,skiprows=2,parse_dates=0,index_col=0,
                       usecols=[0,1,2],header=None,
                       names=['date_time','%s_depth'%site_name,'%s_temp'%site_name])
    return df

def make_OTT_df(FILEDIR,OTT_FILE):
    xl = pd.ExcelFile(FILEDIR + OTT_FILE)
    site_names = xl.sheet_names
    df_list = [OTT_read(xl,site_name) for site_name in site_names]
    return site_names,df_list

FILEDIR='c:/downloads/'
OTT_FILE='OTT_Data_All_stations.xlsx'
site_names_OTT,df_list_OTT = make_OTT_df(FILEDIR,OTT_FILE)
Ernaernald answered 27/2, 2015 at 13:21 Comment(3)
Could you try saving as csv and loading it, it's possible the excel reader is not as fast as the csv onePrepositive
It has multiple worksheets though so won't that not work?Ernaernald
You should still be able to save each sheet, unfortunately the pain here is that you have to save each sheet separately, 14MB is not a large size and the csv reader will eat this up very quickly. Another point maybe to try ExcelFile.parsePrepositive
W
49

As others have suggested, csv reading is faster. So if you are on windows and have Excel, you could call a vbscript to convert the Excel to csv and then read the csv. I tried the script below and it took about 30 seconds.

# create a list with sheet numbers you want to process
sheets = map(str,range(1,6))

# convert each sheet to csv and then read it using read_csv
df={}
from subprocess import call
excel='C:\\Users\\rsignell\\OTT_Data_All_stations.xlsx'
for sheet in sheets:
    csv = 'C:\\Users\\rsignell\\test' + sheet + '.csv' 
    call(['cscript.exe', 'C:\\Users\\rsignell\\ExcelToCsv.vbs', excel, csv, sheet])
    df[sheet]=pd.read_csv(csv)

Here's a little snippet of python to create the ExcelToCsv.vbs script:

#write vbscript to file
vbscript="""if WScript.Arguments.Count < 3 Then
    WScript.Echo "Please specify the source and the destination files. Usage: ExcelToCsv <xls/xlsx source file> <csv destination file> <worksheet number (starts at 1)>"
    Wscript.Quit
End If

csv_format = 6

Set objFSO = CreateObject("Scripting.FileSystemObject")

src_file = objFSO.GetAbsolutePathName(Wscript.Arguments.Item(0))
dest_file = objFSO.GetAbsolutePathName(WScript.Arguments.Item(1))
worksheet_number = CInt(WScript.Arguments.Item(2))

Dim oExcel
Set oExcel = CreateObject("Excel.Application")

Dim oBook
Set oBook = oExcel.Workbooks.Open(src_file)
oBook.Worksheets(worksheet_number).Activate

oBook.SaveAs dest_file, csv_format

oBook.Close False
oExcel.Quit
""";

f = open('ExcelToCsv.vbs','w')
f.write(vbscript.encode('utf-8'))
f.close()

This answer benefited from Convert XLS to CSV on command line and csv & xlsx files import to pandas data frame: speed issue

Wyrick answered 27/2, 2015 at 16:15 Comment(3)
If on Linux ,what is the solution?Carlynne
It was giving error of "TypeError: write() argument must be str, not bytes", So I changed it to f = open('ExcelToCsv.vbs','wb'). Thanks.Prebo
Be careful converting your file to CSV. Excel's csv converter does things to the data. For instance, it'll remove leading zeroes from text. It might convert a Boston zip code from 02108 to 2108. It can't handle values with mixes of commas and quotes well. It'll screw with your number formatting, possibly losing data.Cara
G
21

I used xlsx2csv to virtually convert excel file to csv in memory and this helped cut the read time to about half.

from xlsx2csv import Xlsx2csv
from io import StringIO
import pandas as pd


def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
    buffer = StringIO()
    Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
    buffer.seek(0)
    df = pd.read_csv(buffer)
    return df
Golem answered 27/7, 2021 at 21:43 Comment(3)
i tried using this code but i think it might be outdated, for anyone looking to use this try the following from xlsx2csv import Xlsx2csv from io import StringIO from openpyxl import load_workbook import pandas as pd def read_excel(path: str, sheet_index: int) -> pd.DataFrame: buffer = StringIO() Xlsx2csv(path, outputencoding="utf-8").convert(buffer,sheetid=sheet_index) buffer.seek(0) df = pd.read_csv(buffer, low_memory=False) return dfGreyback
In my tests, the performance difference between pd.read_excel(excel_file, sheet_name=None) and the Xlsx2csv version is minimal, with Xlsx2csv sometimes being slower. My file has 5 worksheets with two columns and ~9k rows each.Nearby
In my practice, Xlsx2csv saved approximately 40% of the time.Congius
B
9

If you have less than 65536 rows (in each sheet) you can try xls (instead of xlsx. In my experience xls is faster than xlsx. It is difficult to compare to csv because it depends on the number of sheets.

Although this is not an ideal solution (xls is a binary old privative format), I have found this is useful if you are working with a lof many sheets, internal formulas with values that are often updated, or for whatever reason you would really like to keep the excel multisheet functionality (instead of csv separated files).

Bohi answered 27/7, 2017 at 12:5 Comment(0)
U
5

In my experience, Pandas read_excel() works fine with Excel files with multiple sheets. As suggested in Using Pandas to read multiple worksheets, if you assign sheet_name to None it will automatically put every sheet in a Dataframe and it will output a dictionary of Dataframes with the keys of sheet names.

But the reason that it takes time is for where you parse texts in your code. 14MB excel with 5 sheets is not that much. I have a 20.1MB excel file with 46 sheets each one with more than 6000 rows and 17 columns and using read_excel it took like below:

t0 = time.time()

def parse(datestr):
    y,m,d = datestr.split("/")
    return dt.date(int(y),int(m),int(d))

data = pd.read_excel("DATA (1).xlsx", sheet_name=None, encoding="utf-8", skiprows=1, header=0, parse_dates=[1], date_parser=parse)

t1 = time.time()

print(t1 - t0)
## result: 37.54169297218323 seconds

In code above data is a dictionary of 46 Dataframes.

As others suggested, using read_csv() can help because reading .csv file is faster. But consider that for the fact that .xlsx files use compression, .csv files might be larger and hence, slower to read. But if you wanted to convert your file to comma-separated using python (VBcode is offered by Rich Signel), you can use: Convert xlsx to csv

Unwinking answered 9/9, 2019 at 10:56 Comment(0)
R
4

I encourage you to do the comparison yourself and see which approach is appropriate in your situation.

For instance, if you are processing a lot of XLSX files and are only going to ever read each one once, you may not want to worry about the CSV conversion. However, if you are going to read the CSVs over and over again, then I would highly recommend saving each of the worksheets in the workbook to a csv once, then read them repeatedly using pd.read_csv().

Below is a simple script that will let you compare Importing XLSX Directly, Converting XLSX to CSV in memory, and Importing CSV. It is based on Jing Xue's answer.

Spoiler alert: If you are going to read the file(s) multiple times, it's going to be faster to convert the XLSX to CSV.

I did some testing with some files I'm working on are here are my results:

5,874 KB xlsx file (29,415 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]:    0:00:31.75
Elapsed time for [Convert XLSX to CSV in mem]: 0:00:22.19
Elapsed time for [Import CSV file]:            0:00:00.21

********************
202,782 KB xlsx file (990,832 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]:    0:17:04.31
Elapsed time for [Convert XLSX to CSV in mem]: 0:12:11.74
Elapsed time for [Import CSV file]:            0:00:07.11

YES! the 202MB file really did take only 7 seconds compared to 17 minutes for the XLSX!!!

If you're ready to set up your own test, just open you XLSX in Excel and save one of the worksheets to CSV. For a final solution, you would obviously need to loop through the worksheets to process each one.

You will also need to pip install rich pandas xlsx2csv.

from rich import print
import pandas as pd
from datetime import datetime
from xlsx2csv import Xlsx2csv
from io import StringIO

def timer(name, startTime = None):
    if startTime:
        print(f"Timer: Elapsed time for [{name}]: {datetime.now() - startTime}")
    else:
        startTime = datetime.now()
        print(f"Timer: Starting [{name}] at {startTime}")
        return startTime


def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
    buffer = StringIO()
    Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
    buffer.seek(0)
    df = pd.read_csv(buffer)
    return df


xlsxFileName = "MyBig.xlsx"
sheetName = "Sheet1"
csvFileName = "MyBig.csv"

startTime = timer(name="Import XLSX with Pandas")
df = pd.read_excel(xlsxFileName, sheet_name=sheetName)
timer("Import XLSX with Pandas", startTime)

startTime = timer(name="Convert XLSX to CSV first")
df = read_excel(path=xlsxFileName, sheet_name=sheetName)
timer("Convert XLSX to CSV first", startTime)

startTime = timer(name="Import CSV")
df = pd.read_csv(csvFileName)
timer("Import CSV", startTime)
Rusk answered 25/3, 2022 at 23:48 Comment(0)
N
4

python-calamine is about 10x faster than read_excel with openpyxl. As of pandas 2.2 it is also natively supported by pandas (see pandas doc).

Below a sample with a 5.2 MB excel file:

file_path = "C:\Users\...\export_20240312_104943.xlsx"

start = time.time()
df = pd.read_excel(file_path, engine='openpyxl')
logging.info(f"Reading with openpyxl took {time.time() - start:.2f} seconds.")

start = time.time()
df = pd.read_excel(file_path, engine='calamine')
logging.info(f"Reading with calamine took {time.time() - start:.2f} seconds.")

2024-03-15 10:24:47 - INFO - Reading with openpyxl took 14.18 seconds.
2024-03-15 10:24:48 - INFO - Reading with calamine took 1.16 seconds.

And if pandas is older than 2.2, fast standalone python_calamine is still avaliable:

import pandas as pd
from python_calamine import CalamineWorkbook

XLS_FILE = "test.xlsx"
SHEET_NAME = "Sheet1"

wb = CalamineWorkbook.from_path(XLS_FILE)
row_list = wb.get_sheet_by_name(SHEET_NAME).to_python()
df = pd.DataFrame(row_list[1:], columns=row_list[0])
Nihon answered 15/3 at 9:26 Comment(0)
P
3

I know this is old but in case anyone else is looking for an answer that doesn't involve VB. Pandas read_csv() is faster but you don't need a VB script to get a csv file.

Open your Excel file and save as *.csv (comma separated value) format.

Under tools you can select Web Options and under the Encoding tab you can change the encoding to whatever works for your data. I ended up using Windows, Western European because Windows UTF encoding is "special" but there's lots of ways to accomplish the same thing. Then use the encoding argument in pd.read_csv() to specify your encoding.

Encoding options are listed here

Pretended answered 15/1, 2019 at 18:29 Comment(0)
J
2

Had a similar problem recently, and solved it by using the library xlwings:

import xlwings as xw
import pandas as pd


def xw_excel_to_df(file_path, sheet_name=None, sheet_range=None):
    app = xw.App()
    book = xw.Book(file_path, read_only=True)

    if sheet_name is None:
        sheet_name = book.sheets[0].name

    if sheet_range is None:
        data = book.sheets[sheet_name]["A1"].expand().value
    else:
        data = book.sheets[sheet_name][sheet_range].value

    df = pd.DataFrame(data=data[1:], columns=data[0])

    book.close()
    app.quit()

    return df

Basically it opens an Excel instance locally, read the data, load it into a dataframe, and closes the Excel.

You must have Excel installed in your PC, then you may have to solve some dependencies to use this resource on remote.

Using it for some local analisys using jupyter notebook, by the way, worked perfectly for me! :)

Jamima answered 15/6, 2023 at 22:46 Comment(0)
P
0

If you can install Libreoffice in your environment. You can use it to convert Excel files to CSV in Python.

The sample code is below:

def convert_excel_to_csv(input_file):
    exe_path = 'C:/Program Files/LibreOffice/program/soffice.exe'
    
    command = [
        exe_path,
        '--headless',
        '--convert-to', 'csv',
        '--outdir', os.path.dirname(__file__),
        input_file
    ]
    subprocess.run(command, check=True)
Peach answered 25/2 at 13:52 Comment(0)
T
-3

There's no reason to open excel if you're willing to deal with slow conversion once.

  1. Read the data into a dataframe with pd.read_excel()
  2. Dump it into a csv right away with pd.to_csv()

Avoid both excel and windows specific calls. In my case the one-time time hit was worth the hassle. I got a ☕.

Townsley answered 2/9, 2019 at 14:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.