I'm working with IoT data from one source which send's tons of GB of sparse data from sensor readings. To make snapshots for analysis I try to export them to a small file and read it later as sparse panda dataframe to keep memory usage low.
With v1.0 of pandas things changes, a lot and I'm stuck :-(
Below my test workflow
- Create some dummy sparse IoT data with different data types (int, float, string as a category, string, date)
- convert them to panda dataframe (works fine)
- save them (to find differences to sparse data) --> Errors
- convert panda frame to sparse dataframe --> Error worked before v1.0
- save them (to find differences to sparse data) --> Errors
Question:
- Please, who knows, what is wrong and how to fix it
- Is my approach of coming to a sparse dataframe correct in the new panda way?
Thanks a lot
First, create some sparse input data
import datetime
import sys
import time
import random
import pandas as pd
from IPython.display import display, HTML
input_data = list()
# create the input for a sparse matrix with different dtypes
columns = list(map(chr, range(ord('F'), ord('Z') - 1)))
category = ['Category A', 'Category B', 'Category C']
random.seed('dsgsdf')
chunk_size = 100 * 1000 # for testing bigger or smaller data sets
for row in range(1 * chunk_size):
r = dict()
r['A'] = row
if random.randint(0, 9) >= 3:
r['B'] = str(datetime.datetime.now()) # make the datetime conversion a bit harder
if random.randint(0, 9) >= 5:
r['C'] = category[random.randint(0, len(category) - 1)]
if random.randint(0, 9) >= 9:
r['D'] = random.randint(0,1000)
if random.randint(0, 9) >= 9:
r['E'] = pd.util.testing.rands(4) # random string = no category
r[columns[random.randint(0, len(columns) - 1)]] = float(random.randint(0, 1000)/2)
input_data.append(r)
Convert to dense pandas dataframe
dense_df = pd.DataFrame(input_data)
dense_df = dense_df.reindex(sorted(dense_df.columns), axis=1) # sort by column name
dense_df['B'] = pd.to_datetime(dense_df['B'], format='%Y-%m-%d %H:%M:%S.%f')
dense_df['C'] = dense_df['C'].astype('category') # strings as category
dense_df = dense_df.convert_dtypes() # without this line the export works, but with the line lower memory usage
Save dense matrix to disk in different formats. This doesn't work, with line .convert_dtypes()
dense_df.to_hdf("data/dense-data-c0.h5", key='my',complevel=0,mode='w',format='table')
dense_df.to_hdf("data/dense-data-c9.h5", key='my',complevel=9,mode='w',format='table')
dense_df.to_pickle("data/dense-data.pkl.zip")
dense_df.to_parquet("data/dense-data.parquet.gzip",compression="gzip",allow_truncated_timestamps=True)
dense_df.to_parquet("data/dense-data.parquet",allow_truncated_timestamps=True)
dense_df.to_pickle("data/dense-data.pkl")
dense_df.to_json("data/dense-data.json")
Convert to sparse matrix so save memory usage. This worked before pandas v1.0
import numpy as np
start = time.time()
sparse_df = dense_df.copy()
# define some sparse dtypes
dtype_float = pd.SparseDtype('float')
dtype_int = pd.SparseDtype('int')
dtype_str = pd.SparseDtype('string')
dtype_datetime = pd.SparseDtype('datetime64')
sparse_df['B'] = sparse_df['B'].astype(dtype_datetime)
sparse_df['E'] = sparse_df['E'].astype(dtype_str)
sparse_df['D'] = sparse_df['D'].astype(dtype_int)
Save dense matrix to disk in different formats. This doesn't work
sparse_df.to_hdf("data/sparse-data-c0.h5", key='my',complevel=0,mode='w',format='table')
sparse_df.to_hdf("data/sparse-data-c9.h5", key='my',complevel=9,mode='w',format='table')
sparse_df.to_pickle("data/sparse-data.pkl.zip")
sparse_df.to_parquet("data/sparse-data.parquet.gzip",compression="gzip",allow_truncated_timestamps=True)
sparse_df.to_parquet("data/sparse-data.parquet",allow_truncated_timestamps=True)
sparse_df.to_pickle("data/sparse-data.pkl")
sparse_df.to_json("data/sparse-data.json")