how can I quickly convert in python an xlsx file into a csv file?
Asked Answered
O

4

5

I have a 140MB Excel file I need to analyze using pandas. The problem is that if I open this file as xlsx it takes python 5 minutes simply to read it. I tried to manually save this file as csv and then it takes Python about a second to open and read it! There are different 2012-2014 solutions that why Python 3 don't really work on my end.

Can somebody suggest how to convert very quickly file 'C:\master_file.xlsx' to 'C:\master_file.csv'?

Ohl answered 7/12, 2017 at 21:4 Comment(1)
github.com/dilshod/xlsx2csv ?Hypognathous
M
4

There is a project aiming to be very pythonic on dealing with data called "rows". It relies on "openpyxl" for xlsx, though. I don't know if this will be faster than Pandas, but anyway:

$ pip install rows openpyxl

And:

import rows
data = rows.import_from_xlsx("my_file.xlsx")
rows.export_to_csv(data, open("my_file.csv", "wb"))
Musket answered 7/12, 2017 at 21:38 Comment(3)
Loading everything into memory isn't really advisable here but since you went LGPL I can't code review.Simeon
Sorry, you you just "can't code review LGPL" if it is against your religion or something like that. Nothing legaly stops you from reviewing LGPL or contributing LGPL code, and that is pure FUD. What LGPL obliges you to is that if you the software in a private project, and modify it, (not the linked parts, just the software proper), you have to publish back your modifications. "Code reviewing on GitHub" is already "published", so LGPL makes no difference. That said, the project is not mine.Musket
Yes, it is against my religion.Simeon
A
3

I faced the same problem as you. Pandas and openpyxl didn't work for me.

I came across with this solution and that worked great for me:

import win32com.client
xl=win32com.client.Dispatch("Excel.Application")
xl.DisplayAlerts = False
xl.Workbooks.Open(Filename=your_file_path,ReadOnly=1)
wb = xl.Workbooks(1)
wb.SaveAs(Filename='new_file.csv', FileFormat='6') #6 means csv
wb.Close(False)
xl.Application.Quit()
wb=None
xl=None

Here you convert the file to csv by means of Excel. All the other ways that I tried refuse to work.

Antoniettaantonin answered 21/3, 2019 at 18:24 Comment(0)
S
2

Use read-only mode in openpyxl. Something like the following should work.

import csv
import openpyxl

wb = load_workbook("myfile.xlsx", read_only=True)
ws = wb['sheetname']
with open("myfile.csv", "wb") as out:
    writer = csv.writer(out)
    for row in ws:
        values = (cell.value for cell in row)
        writer.writerow(values)
Simeon answered 8/12, 2017 at 8:28 Comment(12)
I don't know why but the output of this code gives me a completelt empty 0 KB csv fileOhl
Did you change the example "sheetname" acordingly, and does it point to a valid sheet? (Note that this kind of detailing and working around is exactly what "rows" aim to take off the way)Musket
Andrea, you should probably add some debugging code such as a counter to see whether the sheet has any rows. Please update the code in your initial question with as close as possible to what you are using. We cannot really help you otherwise.Simeon
Absolutely. so I have an exce file called heavy_file2017-12-08.xlsx and inside it has a sheet called "Vol_Summary". this is the only sheet I need to save as new csv file. So I don't have an already existing csv file, it should be generated by the code.Ohl
wb = load_workbook("c:\\heavy_file2017-12-08.xlsx", read_only=True) ws = wb['Vol_Summary'] with open("c:\\heavy_file2017-12-08.csv", "wb") as out: writer = csv.writer(out) for row in ws: values = (cell.value for cell in row) writer.writerow(values)Ohl
I have the doubt the parameter "out" in writer = csv.writer might be what is causing the code not to work but I'm not 100% sureOhl
I actually think the problem might be that in the xlsx file the data is in cells D15 - BB:30 so the first rows and left columns are empty. could it be the issue?Ohl
As noted previously, without more information we can't really help more than this.Simeon
I can tell you that I entered a print inside the look for and that doesn't print anything. meaning it doesnt loop through the rowsOhl
also if I add print('\n max row: ',ws.max_row) after ws = wb['sheetname'] it shows 1...Ohl
Please update your question with your code. It makes no sense to include snippets of it in the comments. I have provided you with the bones of an answer but you continue to ask questions in a way that no one can really help.Simeon
you're right Charlie and I apologize for not using the right format. I'm a newbie here and I can't see how i can upload the code precisely as you did aboveOhl
S
1

Fastest way that pops to mind:

  1. pandas.read_excel
  2. pandas.DataFrame.to_csv

As an added benefit, you'll be able to do cleanup of the data before saving it to csv.

import pandas as pd
df = pd.read_excel('C:\master_file.xlsx', header=0) #, sheetname='<your sheet>'
df.to_csv('C:\master_file.csv', index=False, quotechar="'")

At some point, dealing with lots of data will take lots of time. Just a fact of life. Good to look for options if it's a problem, though.

Symptom answered 7/12, 2017 at 21:8 Comment(6)
thank you man, appreciated your answer. unfortunately, that's what i currently do an the pd.read_excel takes really foreverOhl
What are your system specs? Where are you pulling the data from (hdd, ssd, network file system, etc)? How many rows are there in your dataset? The code in my answer, on my system, processes 1.17 gb of data with around 10 million records in about 5 minutes. Since you're using the same approach, if I had to guess, I'd think that your bottleneck might be something besides the python code.Symptom
The problem with using Pandas to do this is that a dataframe is column-based wheareas both Excel and CSV are row-based. This means that all values must be loaded into memory before a conversion can happen and, hence, that Pandas is unsuitable for this task.Simeon
@CharlieClark thank you. I don't need to use pandas in this case. do you think there's a different python solution to perform the conversion?Ohl
@Symptom thank you - i have a bacth file, size 140MB, 6 sheets inside the file, each with 8 rows and 300 columns. How can I check whatelse could be the bottle neck?Ohl
@Andrea, especially with that setup, I think Charlie is right. It's been my observation that any time you have to work with sheets, though, it really kills performance.Symptom

© 2022 - 2024 — McMap. All rights reserved.