Python pandas persistent cache
Asked Answered
F

3

19

Is there an implementation for python pandas that cache the data on disk so I can avoid to reproduce it every time?

In particular is there a caching method for get_yahoo_data for financial?

A very plus would be:

  • very few lines of code to write
  • possibility to integrate the persisted series when new data is downloaded for the same source
Frasco answered 8/7, 2018 at 19:30 Comment(0)
J
23

There are many ways to achieve this, however probably the easiest way is to use the build in methods for writing and reading Python pickles. You can use pandas.DataFrame.to_pickle to store the DataFrame to disk and pandas.read_pickle to read the stored DataFrame from disk.

An example for a pandas.DataFrame:

# Store your DataFrame
df.to_pickle('cached_dataframe.pkl') # will be stored in current directory

# Read your DataFrame
df = pandas.read_pickle('cached_dataframe.pkl') # read from current directory

The same methods also work for pandas.Series:

# Store your Series
series.to_pickle('cached_series.pkl') # will be stored in current directory

# Read your DataFrame
series = pandas.read_pickle('cached_series.pkl') # read from current directory
Jennine answered 8/7, 2018 at 21:25 Comment(0)
L
9

You could use the Data cache package.

from data_cache import pandas_cache

@pandas_cache
def foo():
    ...
Legerdemain answered 1/7, 2020 at 13:9 Comment(1)
Yeeeeeeeeeeeeeeeeees <3Embowel
M
4

Depend on different requirements, there are a dozen of methods to do that, to and fro, in CSV, Excel, JSON, Python Pickle Format, HDF5 and even SQL with DB, etc.

In terms of code lines, to/read many of these formats are just one line of code for each direction. Python and Pandas already make the code as clean as possible, so you could worry less about that.

I think there is no single solution to fit all requirements, really case by case:

  • for human readability of saved data: CSV, Excel
  • for binary python object serialization (use-cases): Pickle
  • for data-interchange: JSON
  • for long-time and incrementally updating: SQL
  • etc.

And if you want to daily update the stock prices and for later usage, I prefer Pandas with SQL Queries, of course this will add few lines of code to set up DB connection:

from sqlalchemy import create_engine

new_data = getting_daily_price()
# You can also choose other db drivers instead of `sqlalchemy`
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
    new_data.to_sql('table_name', conn) # To Write
    df = pd.read_sql_table('sql_query', conn) # To Read
Maxantia answered 9/7, 2018 at 0:53 Comment(3)
I use python to avoid writing a lot of code, which solution would be the best and the more cheaper in term of code lines?Frasco
@LucaC. I just rewrote the answer according to your updated question. Hope this can be more helpful.Maxantia
Here is a comparison of speeds for several serialization methods. I think that JSON is the best choice here due to readability, speed, and security. benfrederickson.com/dont-pickle-your-dataCrossways

© 2022 - 2024 — McMap. All rights reserved.