My Pandas data frame consists of Tweets and meta data of each tweet (300.000 rows). Some of my colleagues need to work with this data in Excel which is why I need to export it.
I wanted to use either .to_csv
or .to_excel
which are both provided by Pandas but I can't get it to work properly.
When I use .to_csv
my problem is that it keeps failing in the text part of the data frame. I've played around with different separators but the file is never 100% aligned. The text column seems to contain tabs, pipe characters etc. which confuses Excel.
df.to_csv('test.csv', sep='\t', encoding='utf-8')
When I try to use .to_excel
together with the xlsxwriter
engine I'm confronted with a different problem, which is that my text column contains to many URLs (I think). xlswriter
tries to make special clickable links of these URLs instead of just handling them as strings. I've found some information on how to circumvent this but, again, I can't get it to work.
The following bit of code should be used to disable the function that I think is causing trouble:
workbook = xlsxwriter.Workbook(filename, {'strings_to_urls': False})
However, when using to_excel
I can't seem to adjust this setting of the Workbook object before I load the data frame into the Excel file.
In short how do I export a column with wildly varying text from a Pandas data frame to something that Excel understands?
edit: example:
@geertwilderspvv @telegraaf ach Wilders toch, nep-voorzitter van een nep-partij met maar één lid, \nzeur niet over nep-premier of parlement!
So in this case It is obviously a line brake that is my data. I will try to find some more examples.
edit2:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><logFileName>error047600_01.xml</logFileName><summary>Er zijn fouten aangetroffen in bestand C:\Users\Guy Mahieu\Desktop\Vu ipython notebook\pandas_simple.xlsx</summary><removedRecords summary="Hier volgt een lijst van verwijderde records:"><removedRecord>Verwijderde records: Formule van het onderdeel /xl/worksheets/sheet1.xml</removedRecord></removedRecords></recoveryLog>
Translation of Dutch stuff:
Errors were found in "file". Here follows a list of removed records: removed records: formula of the part /xl/worksheets/sheet1.xml