Efficiently create sparse pivot tables in pandas?

Asked 27/7, 2015 at 19:26 Answered 12/9, 2023 at 14:8

Solved python pandas scipy scikit-learn sparse-matrix

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.

Alternatively, is there some kind of sparse pivot capability outside of pandas?

edit: here is an example of a non-sparse pivot

import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]

frame

  person thing  count
0     me     a      1
1    you     a      1
2    him     b      1
3    you     c      1
4    him     d      1
5     me     d      1

frame.pivot('person','thing')

        count            
thing       a   b   c   d
person                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.

http://docs.scipy.org/doc/scipy/reference/sparse.html

Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

Ornery answered 27/7, 2015 at 19:26 Comment(5)

Could you provide some sample input, output, code ? – Stamp 27/7, 2015 at 21:11

what does sparse mean? – Cottonwood 27/7, 2015 at 21:14

@Cottonwood It's a mathematical term en.m.wikipedia.org/wiki/Sparse_matrix – Accident 27/7, 2015 at 22:1

Just added an example and an explanation. Thanks! – Ornery 28/7, 2015 at 12:22

Pivot tables are just ways to view your original data, which is already sparse (other than converting person and thing to integers) – Samples 29/7, 2015 at 16:33

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)

row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
                           shape=(person_c.categories.size, thing_c.categories.size))

>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)


dfs = pd.SparseDataFrame(sparse_matrix, \
                         index=person_c.categories, \
                         columns=thing_c.categories, \
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

The main changes were:

.astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
sort() doesn't work anymore

Other changes were more superficial:

using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
pandas SparseDataFrame accepts a scipy.sparse object directly now

Sneaky answered 10/11, 2018 at 0:50 Comment(2)

Since Pandas 1.0.1 you need to replace pd.SparseDataFrame() with pd.DataFrame.sparse.from_spmatrix(). See: pandas.pydata.org/pandas-docs/stable/user_guide/… – Casiecasilda 3/3, 2020 at 12:52

This one was nice! Thank you very much. – Worthless 15/11, 2020 at 19:0

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

from scipy.sparse import csr_matrix

person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))

data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))

>>> sparse_matrix 
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()

matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]])

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)

>>> dfs
     a  b  c  d
him  0  1  0  1
me   1  0  0  1
you  1  0  1  0

>>> type(dfs)
pandas.sparse.frame.SparseDataFrame

Goingson answered 28/7, 2015 at 14:33 Comment(4)

Thanks! I was really hoping to avoid creating a dense matrix and then using to_sparse() because doing so will still require the amount of memory needed for the dense matrix at some point or another. I feel like there are other Pandas functions that can output sparse data, but maybe I'm wrong or maybe I have to look elsewhere. – Ornery 28/7, 2015 at 14:53

Very interesting. My plan was to try something like this if there wasn't a solution out there, but I would have needed to learn a bit more about scipy sparse matrices first. Now I can learn from your code. Thanks! – Ornery 28/7, 2015 at 18:7

why do you sort the list, e.g. person_u = list(sort(frame.person.unique()))..it seems that the final matrix (sparse_matrix) not corresponds to the dataframe – Smithy 30/11, 2017 at 9:10

pandas.DataFrame.astype does not take categories= argument anymore! Your answer needs to be updated based on new version – Mammiemammiferous 27/1, 2023 at 14:6

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)

row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
                           shape=(person_c.categories.size, thing_c.categories.size))

>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)


dfs = pd.SparseDataFrame(sparse_matrix, \
                         index=person_c.categories, \
                         columns=thing_c.categories, \
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

The main changes were:

.astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
sort() doesn't work anymore

Other changes were more superficial:

using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
pandas SparseDataFrame accepts a scipy.sparse object directly now

Sneaky answered 10/11, 2018 at 0:50 Comment(2)

Since Pandas 1.0.1 you need to replace pd.SparseDataFrame() with pd.DataFrame.sparse.from_spmatrix(). See: pandas.pydata.org/pandas-docs/stable/user_guide/… – Casiecasilda 3/3, 2020 at 12:52

This one was nice! Thank you very much. – Worthless 15/11, 2020 at 19:0

I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame that define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:

# function
import pandas as pd

from scipy.sparse import csr_matrix

def df_to_sm(data, vars_i, vars_j):
    grpr_i = data.groupby(vars_i).grouper

    idx_i = grpr_i.group_info[0]

    grpr_j = data.groupby(vars_j).grouper

    idx_j = grpr_j.group_info[0]

    data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
                         shape=(grpr_i.ngroups, grpr_j.ngroups))

    return data_sm, grpr_i, grpr_j


# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
                     'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
                     'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
                     'val' : [1, 2, 3, 4, 5, 6]})

data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])

data_sm.todense()

Silicone answered 23/7, 2016 at 21:18 Comment(1)

Nice! I'm not using sparse pivots at the moment, but I'll be sure to check this out. Thanks for contributing! – Ornery 25/7, 2016 at 15:13

Here is an answer that updates the approach in the answer by @Alnilam to use up-to-date pandas libraries which no longer contain all of the functions in that answer.

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

rcLabel, vLabel = ('person', 'thing'), 'count'
rcCat = [CategoricalDtype(sorted(frame[col].unique()), ordered=True) for col in rcLabel]
rc = [frame[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((frame[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfPivot = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )

Byrle answered 19/3, 2023 at 21:1 Comment(0)

I don't know when it changed, but in Pandas v2.1.0 if your values column is a pd.SparseDtype, then pivot() and pivot_table() will generate sparse columns.

Slavish answered 12/9, 2023 at 14:8 Comment(3)

This is incorrect. After assigning values to pd.SparseDtype, though the pivoted table dtypes will also be pd.SparseDtype, the pivoted table itself will be dense. Continuing with the original example, frame["count"] = frame["count"].astype(pd.SparseDtype(dtype=int, fill_value=0)) frame.pivot(index='person',columns='thing',values="count").sparse.density >>> 1.0 So, the pivoted table is fully dense (populated with NaN values) and is not sparse despite its dtypes being pd.SparseDtype. – Gonium 30/5, 2024 at 0:13

That sounds like a bug, maybe open an issue on the pandas github? – Slavish 30/5, 2024 at 13:29

I don't think it's a bug. I could not find any documentation in Panda's implementation of pivot that it can preserve sparsity when using dataframes with sparse values. It could be requested as a feature in future versions of Pandas. – Gonium 30/5, 2024 at 19:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags