Efficiently create sparse pivot tables in pandas?
Asked Answered
O

5

41

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.

Alternatively, is there some kind of sparse pivot capability outside of pandas?

edit: here is an example of a non-sparse pivot

import pandas as pd
frame=pd.DataFrame()
frame['person']=['me','you','him','you','him','me']
frame['thing']=['a','a','b','c','d','d']
frame['count']=[1,1,1,1,1,1]

frame

  person thing  count
0     me     a      1
1    you     a      1
2    him     b      1
3    you     c      1
4    him     d      1
5     me     d      1

frame.pivot('person','thing')

        count            
thing       a   b   c   d
person                   
him       NaN   1 NaN   1
me          1 NaN NaN   1
you         1 NaN   1 NaN

This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.

http://docs.scipy.org/doc/scipy/reference/sparse.html

Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

Ornery answered 27/7, 2015 at 19:26 Comment(5)
Could you provide some sample input, output, code ?Stamp
what does sparse mean?Cottonwood
@Cottonwood It's a mathematical term en.m.wikipedia.org/wiki/Sparse_matrixAccident
Just added an example and an explanation. Thanks!Ornery
Pivot tables are just ways to view your original data, which is already sparse (other than converting person and thing to integers)Samples
S
31

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)

row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
                           shape=(person_c.categories.size, thing_c.categories.size))

>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)


dfs = pd.SparseDataFrame(sparse_matrix, \
                         index=person_c.categories, \
                         columns=thing_c.categories, \
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

The main changes were:

  • .astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
  • sort() doesn't work anymore

Other changes were more superficial:

  • using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
  • the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
  • pandas SparseDataFrame accepts a scipy.sparse object directly now
Sneaky answered 10/11, 2018 at 0:50 Comment(2)
Since Pandas 1.0.1 you need to replace pd.SparseDataFrame() with pd.DataFrame.sparse.from_spmatrix(). See: pandas.pydata.org/pandas-docs/stable/user_guide/…Casiecasilda
This one was nice! Thank you very much.Worthless
G
35

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

from scipy.sparse import csr_matrix

person_u = list(sort(frame.person.unique()))
thing_u = list(sort(frame.thing.unique()))

data = frame['count'].tolist()
row = frame.person.astype('category', categories=person_u).cat.codes
col = frame.thing.astype('category', categories=thing_u).cat.codes
sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))

>>> sparse_matrix 
<3x4 sparse matrix of type '<type 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()

matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]])

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)

>>> dfs
     a  b  c  d
him  0  1  0  1
me   1  0  0  1
you  1  0  1  0

>>> type(dfs)
pandas.sparse.frame.SparseDataFrame
Goingson answered 28/7, 2015 at 14:33 Comment(4)
Thanks! I was really hoping to avoid creating a dense matrix and then using to_sparse() because doing so will still require the amount of memory needed for the dense matrix at some point or another. I feel like there are other Pandas functions that can output sparse data, but maybe I'm wrong or maybe I have to look elsewhere.Ornery
Very interesting. My plan was to try something like this if there wasn't a solution out there, but I would have needed to learn a bit more about scipy sparse matrices first. Now I can learn from your code. Thanks!Ornery
why do you sort the list, e.g. person_u = list(sort(frame.person.unique()))..it seems that the final matrix (sparse_matrix) not corresponds to the dataframeSmithy
pandas.DataFrame.astype does not take categories= argument anymore! Your answer needs to be updated based on new versionMammiemammiferous
S
31

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True)
thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)

row = frame.person.astype(person_c).cat.codes
col = frame.thing.astype(thing_c).cat.codes
sparse_matrix = csr_matrix((frame["count"], (row, col)), \
                           shape=(person_c.categories.size, thing_c.categories.size))

>>> sparse_matrix
<3x4 sparse matrix of type '<class 'numpy.int64'>'
     with 6 stored elements in Compressed Sparse Row format>

>>> sparse_matrix.todense()
matrix([[0, 1, 0, 1],
        [1, 0, 0, 1],
        [1, 0, 1, 0]], dtype=int64)


dfs = pd.SparseDataFrame(sparse_matrix, \
                         index=person_c.categories, \
                         columns=thing_c.categories, \
                         default_fill_value=0)
>>> dfs
        a   b   c   d
 him    0   1   0   1
  me    1   0   0   1
 you    1   0   1   0

The main changes were:

  • .astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
  • sort() doesn't work anymore

Other changes were more superficial:

  • using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
  • the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
  • pandas SparseDataFrame accepts a scipy.sparse object directly now
Sneaky answered 10/11, 2018 at 0:50 Comment(2)
Since Pandas 1.0.1 you need to replace pd.SparseDataFrame() with pd.DataFrame.sparse.from_spmatrix(). See: pandas.pydata.org/pandas-docs/stable/user_guide/…Casiecasilda
This one was nice! Thank you very much.Worthless
S
4

I had a similar problem and I stumbled over this post. The only difference was that that I had two columns in the DataFrame that define the "row dimension" (i) of the output matrix. I thought this might be an interesting generalisation, I used the grouper:

# function
import pandas as pd

from scipy.sparse import csr_matrix

def df_to_sm(data, vars_i, vars_j):
    grpr_i = data.groupby(vars_i).grouper

    idx_i = grpr_i.group_info[0]

    grpr_j = data.groupby(vars_j).grouper

    idx_j = grpr_j.group_info[0]

    data_sm = csr_matrix((data['val'].values, (idx_i, idx_j)),
                         shape=(grpr_i.ngroups, grpr_j.ngroups))

    return data_sm, grpr_i, grpr_j


# example
data = pd.DataFrame({'var_i_1' : ['a1', 'a1', 'a1', 'a2', 'a2', 'a3'],
                     'var_i_2' : ['b2', 'b1', 'b1', 'b1', 'b1', 'b4'],
                     'var_j_1' : ['c2', 'c3', 'c2', 'c1', 'c2', 'c3'],
                     'val' : [1, 2, 3, 4, 5, 6]})

data_sm, _, _ = df_to_sm(data, ['var_i_1', 'var_i_2'], ['var_j_1'])

data_sm.todense()
Silicone answered 23/7, 2016 at 21:18 Comment(1)
Nice! I'm not using sparse pivots at the moment, but I'll be sure to check this out. Thanks for contributing!Ornery
B
1

Here is an answer that updates the approach in the answer by @Alnilam to use up-to-date pandas libraries which no longer contain all of the functions in that answer.

from scipy.sparse import csr_matrix
from pandas.api.types import CategoricalDtype

rcLabel, vLabel = ('person', 'thing'), 'count'
rcCat = [CategoricalDtype(sorted(frame[col].unique()), ordered=True) for col in rcLabel]
rc = [frame[column].astype(aType).cat.codes for column, aType in zip(rcLabel, rcCat)]
mat = csr_matrix((frame[vLabel], rc), shape=tuple(cat.categories.size for cat in rcCat))
dfPivot = ( pd.DataFrame.sparse.from_spmatrix(
    mat, index=rcCat[0].categories, columns=rcCat[1].categories) )
Byrle answered 19/3, 2023 at 21:1 Comment(0)
S
0

I don't know when it changed, but in Pandas v2.1.0 if your values column is a pd.SparseDtype, then pivot() and pivot_table() will generate sparse columns.

Slavish answered 12/9, 2023 at 14:8 Comment(3)
This is incorrect. After assigning values to pd.SparseDtype, though the pivoted table dtypes will also be pd.SparseDtype, the pivoted table itself will be dense. Continuing with the original example, frame["count"] = frame["count"].astype(pd.SparseDtype(dtype=int, fill_value=0)) frame.pivot(index='person',columns='thing',values="count").sparse.density >>> 1.0 So, the pivoted table is fully dense (populated with NaN values) and is not sparse despite its dtypes being pd.SparseDtype.Gonium
That sounds like a bug, maybe open an issue on the pandas github?Slavish
I don't think it's a bug. I could not find any documentation in Panda's implementation of pivot that it can preserve sparsity when using dataframes with sparse values. It could be requested as a feature in future versions of Pandas.Gonium

© 2022 - 2025 — McMap. All rights reserved.