how can I quickly convert in python an xlsx file into a csv file?

Asked 7/12, 2017 at 21:4 Answered 21/3, 2019 at 18:24

I have a 140MB Excel file I need to analyze using pandas. The problem is that if I open this file as xlsx it takes python 5 minutes simply to read it. I tried to manually save this file as csv and then it takes Python about a second to open and read it! There are different 2012-2014 solutions that why Python 3 don't really work on my end.

Can somebody suggest how to convert very quickly file 'C:\master_file.xlsx' to 'C:\master_file.csv'?

Ohl answered 7/12, 2017 at 21:4 Comment(1)

github.com/dilshod/xlsx2csv ? – Hypognathous 7/12, 2017 at 21:14

There is a project aiming to be very pythonic on dealing with data called "rows". It relies on "openpyxl" for xlsx, though. I don't know if this will be faster than Pandas, but anyway:

$ pip install rows openpyxl

And:

import rows
data = rows.import_from_xlsx("my_file.xlsx")
rows.export_to_csv(data, open("my_file.csv", "wb"))

Musket answered 7/12, 2017 at 21:38 Comment(3)

Loading everything into memory isn't really advisable here but since you went LGPL I can't code review. – Simeon 8/12, 2017 at 15:5

Sorry, you you just "can't code review LGPL" if it is against your religion or something like that. Nothing legaly stops you from reviewing LGPL or contributing LGPL code, and that is pure FUD. What LGPL obliges you to is that if you the software in a private project, and modify it, (not the linked parts, just the software proper), you have to publish back your modifications. "Code reviewing on GitHub" is already "published", so LGPL makes no difference. That said, the project is not mine. – Musket 8/12, 2017 at 15:38

Yes, it is against my religion. – Simeon 8/12, 2017 at 17:0

I faced the same problem as you. Pandas and openpyxl didn't work for me.

I came across with this solution and that worked great for me:

import win32com.client
xl=win32com.client.Dispatch("Excel.Application")
xl.DisplayAlerts = False
xl.Workbooks.Open(Filename=your_file_path,ReadOnly=1)
wb = xl.Workbooks(1)
wb.SaveAs(Filename='new_file.csv', FileFormat='6') #6 means csv
wb.Close(False)
xl.Application.Quit()
wb=None
xl=None

Here you convert the file to csv by means of Excel. All the other ways that I tried refuse to work.

Antoniettaantonin answered 21/3, 2019 at 18:24 Comment(0)

Use read-only mode in openpyxl. Something like the following should work.

import csv
import openpyxl

wb = load_workbook("myfile.xlsx", read_only=True)
ws = wb['sheetname']
with open("myfile.csv", "wb") as out:
    writer = csv.writer(out)
    for row in ws:
        values = (cell.value for cell in row)
        writer.writerow(values)

Simeon answered 8/12, 2017 at 8:28 Comment(12)

I don't know why but the output of this code gives me a completelt empty 0 KB csv file – Ohl 8/12, 2017 at 13:46

Did you change the example "sheetname" acordingly, and does it point to a valid sheet? (Note that this kind of detailing and working around is exactly what "rows" aim to take off the way) – Musket 8/12, 2017 at 14:48

Andrea, you should probably add some debugging code such as a counter to see whether the sheet has any rows. Please update the code in your initial question with as close as possible to what you are using. We cannot really help you otherwise. – Simeon 8/12, 2017 at 15:2

Absolutely. so I have an exce file called heavy_file2017-12-08.xlsx and inside it has a sheet called "Vol_Summary". this is the only sheet I need to save as new csv file. So I don't have an already existing csv file, it should be generated by the code. – Ohl 8/12, 2017 at 16:3

wb = load_workbook("c:\\heavy_file2017-12-08.xlsx", read_only=True) ws = wb['Vol_Summary'] with open("c:\\heavy_file2017-12-08.csv", "wb") as out: writer = csv.writer(out) for row in ws: values = (cell.value for cell in row) writer.writerow(values) – Ohl 8/12, 2017 at 16:4

I have the doubt the parameter "out" in writer = csv.writer might be what is causing the code not to work but I'm not 100% sure – Ohl 8/12, 2017 at 16:18

I actually think the problem might be that in the xlsx file the data is in cells D15 - BB:30 so the first rows and left columns are empty. could it be the issue? – Ohl 8/12, 2017 at 16:42

As noted previously, without more information we can't really help more than this. – Simeon 8/12, 2017 at 17:0

I can tell you that I entered a print inside the look for and that doesn't print anything. meaning it doesnt loop through the rows – Ohl 8/12, 2017 at 17:20

also if I add print('\n max row: ',ws.max_row) after ws = wb['sheetname'] it shows 1... – Ohl 8/12, 2017 at 17:21

Please update your question with your code. It makes no sense to include snippets of it in the comments. I have provided you with the bones of an answer but you continue to ask questions in a way that no one can really help. – Simeon 8/12, 2017 at 18:6

you're right Charlie and I apologize for not using the right format. I'm a newbie here and I can't see how i can upload the code precisely as you did above – Ohl 8/12, 2017 at 20:33

Fastest way that pops to mind:

As an added benefit, you'll be able to do cleanup of the data before saving it to csv.

import pandas as pd
df = pd.read_excel('C:\master_file.xlsx', header=0) #, sheetname='<your sheet>'
df.to_csv('C:\master_file.csv', index=False, quotechar="'")

At some point, dealing with lots of data will take lots of time. Just a fact of life. Good to look for options if it's a problem, though.

Symptom answered 7/12, 2017 at 21:8 Comment(6)

thank you man, appreciated your answer. unfortunately, that's what i currently do an the pd.read_excel takes really forever – Ohl 7/12, 2017 at 21:14

What are your system specs? Where are you pulling the data from (hdd, ssd, network file system, etc)? How many rows are there in your dataset? The code in my answer, on my system, processes 1.17 gb of data with around 10 million records in about 5 minutes. Since you're using the same approach, if I had to guess, I'd think that your bottleneck might be something besides the python code. – Symptom 7/12, 2017 at 23:29

The problem with using Pandas to do this is that a dataframe is column-based wheareas both Excel and CSV are row-based. This means that all values must be loaded into memory before a conversion can happen and, hence, that Pandas is unsuitable for this task. – Simeon 8/12, 2017 at 8:23

@CharlieClark thank you. I don't need to use pandas in this case. do you think there's a different python solution to perform the conversion? – Ohl 8/12, 2017 at 13:39

@Symptom thank you - i have a bacth file, size 140MB, 6 sheets inside the file, each with 8 rows and 300 columns. How can I check whatelse could be the bottle neck? – Ohl 8/12, 2017 at 13:40

@Andrea, especially with that setup, I think Charlie is right. It's been my observation that any time you have to work with sheets, though, it really kills performance. – Symptom 8/12, 2017 at 15:55

Recommended topics

Hot tags