Export Pandas data frame with text column containg utf-8 text and URLs to Excel
Asked Answered
P

2

5

My Pandas data frame consists of Tweets and meta data of each tweet (300.000 rows). Some of my colleagues need to work with this data in Excel which is why I need to export it.

I wanted to use either .to_csv or .to_excel which are both provided by Pandas but I can't get it to work properly.

When I use .to_csv my problem is that it keeps failing in the text part of the data frame. I've played around with different separators but the file is never 100% aligned. The text column seems to contain tabs, pipe characters etc. which confuses Excel.

df.to_csv('test.csv', sep='\t', encoding='utf-8')

When I try to use .to_excel together with the xlsxwriter engine I'm confronted with a different problem, which is that my text column contains to many URLs (I think). xlswriter tries to make special clickable links of these URLs instead of just handling them as strings. I've found some information on how to circumvent this but, again, I can't get it to work.

The following bit of code should be used to disable the function that I think is causing trouble:

workbook = xlsxwriter.Workbook(filename, {'strings_to_urls': False})

However, when using to_excel I can't seem to adjust this setting of the Workbook object before I load the data frame into the Excel file.

In short how do I export a column with wildly varying text from a Pandas data frame to something that Excel understands?

edit: example:

@geertwilderspvv @telegraaf ach Wilders toch, nep-voorzitter van een nep-partij met maar één lid, \nzeur niet over nep-premier of parlement!

So in this case It is obviously a line brake that is my data. I will try to find some more examples.

edit2:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error047600_01.xml</logFileName><summary>Er zijn fouten aangetroffen in bestand C:\Users\Guy Mahieu\Desktop\Vu ipython notebook\pandas_simple.xlsx</summary><removedRecords summary="Hier volgt een lijst van verwijderde records:"><removedRecord>Verwijderde records: Formule van het onderdeel /xl/worksheets/sheet1.xml</removedRecord></removedRecords></recoveryLog>

Translation of Dutch stuff:

Errors were found in "file". Here follows a list of removed records: removed records: formula of the part /xl/worksheets/sheet1.xml

Palma answered 23/11, 2015 at 15:57 Comment(2)
can you give us a sample of the df that doesn't work?Miliaria
@JamesTobin I've added an example, I will add more if needed.Palma
G
8

I don't think it is currently possible to pass XlsxWriter constructor options via the Pandas API but you can workaround the strings_to_url issue as follows:

import pandas as pd

df = pd.DataFrame({'Data': ['http://python.org']})

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')

# Don't convert url-like strings to urls.
writer.book.strings_to_urls = False

# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')

# Close the Pandas Excel writer and output the Excel file.
writer.save()

Update: In recent version of Pandas you can pass XlsxWriter constructor options to ExcelWriter() directly and you do not need to set writer.book.strings_to_urls indirectly:

writer = pd.ExcelWriter('pandas_example.xlsx',
                        engine='xlsxwriter',
                        engine_kwargs={'options': {'strings_to_urls': False}})

See also Passing XlsxWriter constructor options to Pandas

Garges answered 23/11, 2015 at 16:36 Comment(7)
The code works so that is a great start. When I open the Excel file I get a warning that it needs to be recovered. I've put the log file in my post for better formatting.Palma
You get a warning when you open the file created by my example?Garges
Yes, that is what I mean.Palma
I don't see any warning when I open the file and there really isn't any reason that there should be for such a simple test case. Also in relation to the warning you added above there isn't any "formula" part in /xl/worksheets/sheet1.xml. Did you add a formula or other data to the dataframe in the example?Garges
Well I was testing it now on my full data frame. I've not added any formulas. To me the log warning seems empty. Apparently it is a known issue/bug: xlsxwriter.readthedocs.org/bugs.html I will look into it further. You have been a great help!!Palma
Perhaps there is some data in the dataframe that is being interpreted as a formula. Try adding the following at the same point as the other option to see if it makes a difference: writer.book.strings_to_formulas= False.Garges
As of Pandas 2.1.1 it should be with pd.ExcelWriter(full_path, engine='xlsxwriter', engine_kwargs={'options': {'strings_to_urls': False}}) as writer: outdf.to_excel(writer, sheet_name='monthly_data', index=False, header=True) Chewning
B
-1
 writer = pd.ExcelWriter(report_file, engine='xlsxwriter', options={'strings_to_urls': False,
                                                                       'strings_to_formulas': False})
Blowfly answered 30/10, 2017 at 13:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.