What format to export pandas dataframe while retaining data types? Not CSV; Sqlite? Parquet?

P

2

8

My workflow typically involves loading some data, typically from CSV files, into a pandas dataframe, cleansing it, defining what the right data type for each column is, then exporting it to a SQL server.

For those situations when a SQL server is not available, what are good alternatives to store the cleansed data and the explicit definition of the data type for each column?

The only real solution I have tested is to export to a sqlite .db file, using the answer here to make sure dates are read as dates.
How about Feather, HDF5, Parquet? Pandas supports them but I don't know much about these formats. I have read feather is not recommended for long-term storage (because the API may change? Not clear)
I am not sure about using pickle: I understand it's not a secure format, and the API keeps changing and breaking backwards compatibility
CSV is not really an option because inferring data types on my data is often a nightmare; when reading the data back into pandas, I'd need to explicitly declare the formats, including the date format, otherwise:
- pandas can create columns where one row is dd-mm-yyyy and another row is mm-dd-yyyy (see here). Plus
- I have many text columns where the first 10k rows seem to be numbers, and the next 100 are text, so most software will infer the column is numeric, then fail on the import. Maybe I'd need to create a function which exports an ancillary file with all the data type definitions, date formats, etc? Feasible but cumbersome.

UPDATE: This is an interesting comparison, according to which HDF5 was the fastest format: https://medium.com/@bobhaffner/gist-to-medium-test-db3d51b8ba7b

I seem to understand that another difference between HDF5 and Parquet is that datetime64 has no direct equivalent in Hdf5. Most people seem to store their dates in HDF5 as ISO-date-formatted (yyyy-mm-dd) strings.

Pothook answered 25/3, 2019 at 17:27 Comment(4)

How big is your data? – Beetner 25/3, 2019 at 17:30

Not "big data" territory! In 80-85% of the cases I deal with tables which are not huge: 10 to 100MBs. In 15-20% of the case I deal with tables in the 100 MB to 1 GB range. I have, so far, never dealt with tables > 1 GB. I'm talking about the sze of uncompressed CSVs. – Pothook 25/3, 2019 at 17:31

I think you will find the best luck in xlsx since it will mostly retain the data type if we look in the broad sense of numeric, text and dates. But to be 100% sure, you will need SQL server which is not an option for you. – Beetner 25/3, 2019 at 18:7

Apart from the fact that reading xlsx is much slower than reading most other formats into pandas, what would be the advantages of using xlsx over a sqlite .db file, HDF5 or Parquet? You cannot define data types in Excel, which is a deal breaker for me. I have long numbers (> 16 digits) which Excel cannot handle, so it chops off the last digits and converts them to zeros. Gene names are reformatted as dates (look it up). All these things are dealbreakers and make xlsx unacceptable for me. – Pothook 25/3, 2019 at 18:11

N

6

If you really want to avoid pickle and saving a CSV (I don't fully agree with your statements about those not being feasible options) then you could run a local database server to save the data in and do a dump/restore process when the SQL server is available again. Otherwise:

Use to_pickle method of the DataFrame object.

Or, save a data type json file with your data types and specify your date format when saving the CSV:

# export
import json
data_types = df.dtypes.astype(str).to_dict()
with open('data_type_key.json', 'w') as f
    json.dump(data_types, f)
df.to_csv('data.csv', date_format='%Y%m%d')

# import
data_types = json.loads('data_type_key.json')
data_frame = pd.read_csv(your_csv_path, dtype=data_types)

Nacreous answered 25/3, 2019 at 18:0 Comment(7)

Did you read the question? He says csv is not an option since has to specify the dtype manually.. – Beetner 25/3, 2019 at 18:4

Like I said very clearly above, I am not convinced by pickle because it may be potentially unsafe but mostly because it is not recommended for long-term storage, as backward-compatibility is not recommended. – Pothook 25/3, 2019 at 18:4

It seems you didn't really read my question. I also explained why using CSVs (at least in the way you describe) wouldn't work for me – Pothook 25/3, 2019 at 18:5

@Pythonistaanonymous pickle is only unsafe if you are loading data from untrusted sources, because it can run arbitrary Python code. If that isn't an issue, you may as well say "I can't use Python source code because it is unsafe". The major compatibility issues with pickle is more in regards to a 2-3 issues. But you can always explicitly fix a pickle protocol, and as long as you aren't trying to make it 2-3 compatible, there shouldn't be an issue. – Loughlin 25/3, 2019 at 18:12

OK, fair enough, that part was stupid of me, But the fact remains that pickle is not recommended for long term storage, and that there have been cases of incompatibility (dataframes exported with one version of pandas that couldn't be read by another version). – Pothook 25/3, 2019 at 18:14

@Pythonistaanonymous you can use DataFrame.dtypes to create a series of datatypes and save that as json as a key for reloading a csv if you are concerned about there being to many columns to explicitly state the dtypes. – Dinger 25/3, 2019 at 18:15

@chet, your example is actually quite interesting - thanks. I hadn't thought of that. I'd need to do some testing to compare the performance vs sqlite and parquet on my data. – Pothook 25/3, 2019 at 22:40

P

7

If your data is 2-dimensional table and is for Bigdata processing like Apache Spark, use parquet. HDF5 is not good for handling date/time as you mentioned.

If your data has 3 or more dimensions, HDF5 will be a good choice - especially for long-term archiving, portability, and sharing.

Apache Feather is the fastest if performance matters.

Pinstripe answered 3/4, 2019 at 4:25 Comment(3)

I have seen many comments that Parquet would be better than feather for long-term storage, but it's not really clear to me why. #48083905 – Pothook 3/4, 2019 at 11:45

Also, with none of these formats it is particularly easy to read the data in a Windows app, or to import into a SQL server skipping Python altogether: #50933929 – Pothook 3/4, 2019 at 11:47

Parquet takes up a third to a half of the space of the equivalent feather file. that's the only difference I've noticed related to storage. – Simultaneous 28/6, 2019 at 23:13

N

6