I encourage you to do the comparison yourself and see which approach is appropriate in your situation.
For instance, if you are processing a lot of XLSX files and are only going to ever read each one once, you may not want to worry about the CSV conversion. However, if you are going to read the CSVs over and over again, then I would highly recommend saving each of the worksheets in the workbook to a csv once, then read them repeatedly using pd.read_csv()
.
Below is a simple script that will let you compare Importing XLSX Directly
, Converting XLSX to CSV in memory
, and Importing CSV
. It is based on Jing Xue's answer.
Spoiler alert: If you are going to read the file(s) multiple times, it's going to be faster to convert the XLSX to CSV.
I did some testing with some files I'm working on are here are my results:
5,874 KB xlsx file (29,415 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:00:31.75
Elapsed time for [Convert XLSX to CSV in mem]: 0:00:22.19
Elapsed time for [Import CSV file]: 0:00:00.21
********************
202,782 KB xlsx file (990,832 rows, 58 columns)
Elapsed time for [Import XLSX with Pandas]: 0:17:04.31
Elapsed time for [Convert XLSX to CSV in mem]: 0:12:11.74
Elapsed time for [Import CSV file]: 0:00:07.11
YES! the 202MB file really did take only 7 seconds compared to 17 minutes for the XLSX!!!
If you're ready to set up your own test, just open you XLSX in Excel and save one of the worksheets to CSV. For a final solution, you would obviously need to loop through the worksheets to process each one.
You will also need to pip install rich pandas xlsx2csv
.
from rich import print
import pandas as pd
from datetime import datetime
from xlsx2csv import Xlsx2csv
from io import StringIO
def timer(name, startTime = None):
if startTime:
print(f"Timer: Elapsed time for [{name}]: {datetime.now() - startTime}")
else:
startTime = datetime.now()
print(f"Timer: Starting [{name}] at {startTime}")
return startTime
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO()
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df
xlsxFileName = "MyBig.xlsx"
sheetName = "Sheet1"
csvFileName = "MyBig.csv"
startTime = timer(name="Import XLSX with Pandas")
df = pd.read_excel(xlsxFileName, sheet_name=sheetName)
timer("Import XLSX with Pandas", startTime)
startTime = timer(name="Convert XLSX to CSV first")
df = read_excel(path=xlsxFileName, sheet_name=sheetName)
timer("Convert XLSX to CSV first", startTime)
startTime = timer(name="Import CSV")
df = pd.read_csv(csvFileName)
timer("Import CSV", startTime)
ExcelFile.parse
– Prepositive