How to convert OpenDocument spreadsheets to a pandas DataFrame?
Asked Answered
A

12

92

The Python library can read Excel spreadsheets and convert them to a pandas.DataFrame with pandas.read_excel(file) command. Under the hood, it uses xlrd library which does not support ods files.

Is there an equivalent of pandas.read_excel for ods files? If not, how can I do the same for an Open Document Formatted spreadsheet (ods file)? ODF is used by LibreOffice and OpenOffice.

Aurlie answered 24/7, 2013 at 13:9 Comment(3)
There has been a feature request on pandas github since 2012 here. I don't know why they haven't implemented it yet!Bounden
it is implemented as of August 2019 :)Orris
The highest voted answer perfectly solves the problem. Perhaps you should accept it?Saville
V
111

This is available natively in pandas 0.25. So long as you have odfpy installed (conda install odfpy OR pip install odfpy) you can do

pd.read_excel("the_document.ods", engine="odf")
Veilleux answered 4/7, 2019 at 3:7 Comment(6)
Don't know why this solution does not have an higher number of votes. Before to use pandas you just need to install odfpy with pip install odfpy.Epidiascope
Installed it using pip install odfpy Result is ImportError: Missing optional dependency 'odf'. Use pip or conda to install odf. Conda does not have this package. I believe the problems is that the package is using Python 2.7, which is near it's end of life.Endorse
It can also be installed with conda install odfpyImbalance
This works great for me after pip install odfpy. Is there a separate engine needed to write back to that Calc spreadsheet?Helicopter
@Helicopter the ability to write odf files was added later in pandas 1.1 ; see github.com/pandas-dev/pandas/pull/32911Veilleux
In more current pandas versions, there is no need to define the engine; it is automatically detected. pd.read_excel("file.ods") is enough.Suchlike
R
21

You can read ODF (Open Document Format .ods) documents in Python using the following modules:

Using ezodf, a simple ODS-to-DataFrame converter could look like this:

import pandas as pd
import ezodf

doc = ezodf.opendoc('some_odf_spreadsheet.ods')

print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))
for sheet in doc.sheets:
    print("-"*40)
    print("   Sheet name : '%s'" % sheet.name)
    print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )

# convert the first sheet to a pandas.DataFrame
sheet = doc.sheets[0]
df_dict = {}
for i, row in enumerate(sheet.rows()):
    # row is a list of cells
    # assume the header is on the first row
    if i == 0:
        # columns as lists in a dictionary
        df_dict = {cell.value:[] for cell in row}
        # create index for the column headers
        col_index = {j:cell.value for j, cell in enumerate(row)}
        continue
    for j, cell in enumerate(row):
        # use header instead of column index
        df_dict[col_index[j]].append(cell.value)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)

P.S.

  • ODF spreadsheet (*.ods files) support has been requested on the pandas issue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.

  • ezodf was used in the unfinished PR9070 to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in this pandas fork.

  • there are also some brute force methods to read directly from the XML code (here)
Rubin answered 23/3, 2016 at 14:21 Comment(5)
Works very well. You should provide something like this as an external package (that depends on both ezodf and pandas) so that users can finally have a read_ods() function !Rosemonde
with this code python return an error ValueError: arrays must all be same length on line 37 df = pd.DataFrame(df_dict)Wriggle
I suppose your spreadsheet has merged cells which causes this error.Farmhand
@Ossarotte, I encountered the same error. Turned out it was because two header columns had the same name (so it counted the rows twice). I renamed one of the columns and the error was gone and column-row lenght are all the same.Aurthur
@Aurthur tanks.Wriggle
G
12

Here is a quick and dirty hack which uses ezodf module:

import pandas as pd
import ezodf

def read_ods(filename, sheet_no=0, header=0):
    tab = ezodf.opendoc(filename=filename).sheets[sheet_no]
    return pd.DataFrame({col[header].value:[x.value for x in col[header+1:]]
                         for col in tab.columns()})

Test:

In [92]: df = read_ods(filename='fn.ods')

In [93]: df
Out[93]:
     a    b    c
0  1.0  2.0  3.0
1  4.0  5.0  6.0
2  7.0  8.0  9.0

NOTES:

  • all other useful parameters like header, skiprows, index_col, parse_cols are NOT implemented in this function - feel free to update this question if you want to implement them
  • ezodf depends on lxml make sure you have it installed
Guardant answered 19/2, 2017 at 18:16 Comment(1)
I tried your code. It is awesome, except it adds two NaN rows at the end of dataframe. have you had similar experience?Bounden
H
7

pandas now supports .ods files. you must install the odfpy module first. then it will work like a normal .xls file.

conda install -c conda-forge odfpy

then

pd.read_excel('FILE_NAME.ods', engine='odf')
Herpetology answered 9/1, 2021 at 17:39 Comment(1)
or even just df = pd.read_excel ('sample.ods')Reconnaissance
Y
5

Edit: Happily, this answer below is now out of date, if you can update to a recent Pandas version. If you'd still like to work from a Pandas version of your data, and update it from ODS only when needed, read on.


It seems the answer is No! And I would characterize the tools to read in ODS still ragged. If you're on POSIX, maybe the strategy of exporting to xlsx on the fly before using Pandas' very nice importing tools for xlsx is an option:

unoconv -f xlsx -o tmp.xlsx myODSfile.ods 

Altogether, my code looks like:

import pandas as pd
import os
if fileOlderThan('tmp.xlsx','myODSfile.ods'):
    os.system('unoconv -f xlsx -o tmp.xlsx myODSfile.ods ')
xl_file = pd.ExcelFile('tmp.xlsx')
dfs = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}
df=dfs['Sheet1']

Here fileOlderThan() is a function (see http://github.com/cpbl/cpblUtilities) which returns true if tmp.xlsx does not exist or is older than the .ods file.

Yanyanaton answered 7/3, 2015 at 20:56 Comment(0)
A
3

Another option: read-ods-with-odfpy. This module takes an OpenDocument Spreadsheet as input, and returns a list, out of which a DataFrame can be created.

Aurlie answered 24/7, 2013 at 17:42 Comment(0)
H
3

If you only have a few .ods files to read, I would just open it in openoffice and save it as an excel file. If you have a lot of files, you could use the unoconv command in Linux to convert the .ods files to .xls programmatically (with bash)

Then it's really easy to read it in with pd.read_excel('filename.xls')

Harbaugh answered 1/8, 2017 at 19:51 Comment(0)
E
2

I've had good luck with pandas read_clipboard. Selecting cells and then copy from excel or opendocument. In python run the following.

import pandas as pd
data = pd.read_clipboard()

Pandas will do a good job based on the cells copied.

Expressive answered 24/8, 2018 at 3:58 Comment(0)
M
2

Some responses have pointed out that odfpy or other external packages are needed to get this functionality, but note that in recent versions of Pandas (current is 1.1, August-2020) there is support for ODS format in functions like pd.ExcelWriter() and pd.read_excel(). You only need to specify the propper engine "odf" to be able of working with OpenDocument file formats (.odf, .ods, .odt).

Mylonite answered 2/8, 2020 at 22:29 Comment(0)
A
1

Based heavily on the answer by davidovitch (thank you), I have put together a package that reads in a .ods file and returns a DataFrame. It's not a full implementation in pandas itself, such as his PR, but it provides a simple read_ods function that does the job.

You can install it with pip install pandas_ods_reader. It's also possible to specify whether the file contains a header row or not, and to specify custom column names.

Argo answered 31/1, 2019 at 22:41 Comment(0)
N
0

There is support for reading Excel files in Pandas (both xls and xlsx), see the read_excel command. You can use OpenOffice to save the spreadsheet as xlsx. The conversion can also be done automatically on the command line, apparently, using the convert-to command line parameter.

Reading the data from xlsx avoids some of the issues (date formats, number formats, unicode) that you may run into when you convert to CSV first.

Neckband answered 9/1, 2015 at 16:37 Comment(0)
S
-1

If possible, save as CSV from the spreadsheet application and then use pandas.read_csv(). IIRC, an 'ods' spreadsheet file actually is an XML file which also contains quite some formatting information. So, if it's about tabular data, extract this raw data first to an intermediate file (CSV, in this case), which you can then parse with other programs, such as Python/pandas.

Somerset answered 24/7, 2013 at 13:33 Comment(3)
Thanks. Would be good if there were something more direct, but I guess that's one possibility.Aurlie
There is nothing more direct than a file containing the raw data only. Such files must be in a certain file format. There are binary formats for this (such as NetCDF or HDF5), and ascii formats, such as CSV. Unfortunately, CSV is not a real standard. Nevertheless, CSV is pretty straight-forward to deal with in most situations.Somerset
It would be nice to have the .csv get updated if you modify the .ods but that's probably outside of the scope of the answer already...Georg

© 2022 - 2024 — McMap. All rights reserved.