How to create lazy_evaluated dataframe columns in Pandas

Asked 26/10, 2013 at 10:20 Answered 5/10, 2023 at 14:11

A lot of times, I have a big dataframe df to hold the basic data, and need to create many more columns to hold the derivative data calculated by basic data columns.

I can do that in Pandas like:

df['derivative_col1'] = df['basic_col1'] + df['basic_col2']
df['derivative_col2'] = df['basic_col1'] * df['basic_col2']
....
df['derivative_coln'] = func(list_of_basic_cols)

etc. Pandas will calculate and allocate the memory for all derivative columns all at once.

What I want now is to have a lazy evaluation mechanism to postpone the calculation and memory allocation of derivative columns to the actual need moment. Somewhat define the lazy_eval_columns as:

df['derivative_col1'] = pandas.lazy_eval(df['basic_col1'] + df['basic_col2'])
df['derivative_col2'] = pandas.lazy_eval(df['basic_col1'] * df['basic_col2'])

That will save the time/memory like Python 'yield' generator, for if I issue df['derivative_col2'] command will only triger the specific calculation and memory allocation.

So how to do lazy_eval() in Pandas ? Any tip/thought/ref are welcome.

Pipestone answered 26/10, 2013 at 10:20 Comment(1)

Great question. Don't know if pandas have such a thing, though. The idea reminds me SQL computed columns in views. – Deka 26/10, 2013 at 15:5

Starting in 0.13 (releasing very soon), you can do something like this. This is using generators to evaluate a dynamic formula. In-line assignment via eval will be an additional feature in 0.13, see here

In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b'])

In [20]: df
Out[20]: 
          a         b
0 -1.949107 -0.763762
1 -0.382173 -0.970349
2  0.202116  0.094344
3 -1.225579 -0.447545
4  1.739508 -0.400829

In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')]

Create a generator that evaluates a formula using eval; assigns the result, then yields the result.

In [22]: def lazy(x, formulas):
   ....:     for col, f in formulas:
   ....:         x[col] = x.eval(f)
   ....:         yield x
   ....:

In action

In [23]: gen = lazy(df,formulas)

In [24]: gen.next()
Out[24]: 
          a         b         c
0 -1.949107 -0.763762 -2.712869
1 -0.382173 -0.970349 -1.352522
2  0.202116  0.094344  0.296459
3 -1.225579 -0.447545 -1.673123
4  1.739508 -0.400829  1.338679

In [25]: gen.next()
Out[25]: 
          a         b         c         d
0 -1.949107 -0.763762 -2.712869  5.287670
1 -0.382173 -0.970349 -1.352522  0.516897
2  0.202116  0.094344  0.296459  0.059919
3 -1.225579 -0.447545 -1.673123  2.050545
4  1.739508 -0.400829  1.338679  2.328644

So its user determined ordering for the evaluation (and not on-demand). In theory numba is going to support this, so pandas possibly support this as a backend for eval (which currently uses numexpr for immediate evaluation).

my 2c.

lazy evaluation is nice, but can easily be achived by using python's own continuation/generate features, so building it into pandas, while possible, is quite tricky, and would need a really nice usecase to be generally useful.

Shermanshermie answered 26/10, 2013 at 20:39 Comment(1)

It is nice to have the 'formula' and eval feature in coming update cersion. And I want to know more about how to use df['lazy_eval_col_x'] syntax to triger the on-demand calculation. – Pipestone 27/10, 2013 at 10:17

You could subclass DataFrame, and add the column as a property. For example,

import pandas as pd

class LazyFrame(pd.DataFrame):
    @property
    def derivative_col1(self):
        self['derivative_col1'] = result = self['basic_col1'] + self['basic_col2']
        return result

x = LazyFrame({'basic_col1':[1,2,3],
               'basic_col2':[4,5,6]})
print(x)
#    basic_col1  basic_col2
# 0           1           4
# 1           2           5
# 2           3           6

Accessing the property (via x.derivative_col1, below) calls the derivative_col1 function defined in LazyFrame. This function computes the result and adds the derived column to the LazyFrame instance:

print(x.derivative_col1)
# 0    5
# 1    7
# 2    9

print(x)
#    basic_col1  basic_col2  derivative_col1
# 0           1           4                5
# 1           2           5                7
# 2           3           6                9

Note that if you modify a basic column:

x['basic_col1'] *= 10

the derived column is not automatically updated:

print(x['derivative_col1'])
# 0    5
# 1    7
# 2    9

But if you access the property, the values are recomputed:

print(x.derivative_col1)
# 0    14
# 1    25
# 2    36

print(x)
#    basic_col1  basic_col2  derivative_col1
# 0          10           4               14
# 1          20           5               25
# 2          30           6               36

Brost answered 5/2, 2014 at 11:35 Comment(0)

Before any modification or extension to Pandas' data-structures, prefer using a library to ease the process load like Dask or Apache Spark.

Pandas' documentation suggests some alternatives for extending its data-structures:

There are some easier alternatives before considering subclassing pandas data structures.

Extensible method chains with pipe

Use composition. See here.

Extending by registering an accessor

Extending by extension type

Subclassing pandas.DataFrame

Use

Let's assume you already have your DataFrame (constructed from the below CustomDataFrame class definition) named df with populated data (for the sake of this example)

def column1_column2_sum(data): return data['column1'] + data['column2']

df[column1_column2_sum.__name__] = column1_column2_sum

Class definition

all_columns method definition assume only one column level here for simplicity.

from typing import Callable, Dict
import pandas as pd


class CustomDataFrame(pd.DataFrame):
    _metadata = ["_lazy_columns"]

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._lazy_columns = {}

    @property
    def _constructor(self):
        return CustomDataFrame

    @property
    def all_columns(self):
        return list(super().columns) + list(self._lazy_columns.keys())

    def register_lazy_series(self, columns: Dict[str, Callable]):
        self._lazy_columns.update(((self._format_key(key), value)
                                   for key, value in columns.items()))

    def __setitem__(self, key, value) -> None:
        if callable(value):
            return self._lazy_columns.__setitem__(key, value)
        return super().__setitem__(key, value)


    def __getitem__(self, key):
        if key in self._lazy_columns.keys():
            compute = self._lazy_columns[key]
            self[key] = compute(self)
            self._lazy_columns.pop(key)

        return super().__getitem__(key)

Feel free to suggest any modification to the above code in the comments.

Tombaugh answered 5/10, 2023 at 14:11 Comment(0)

Subclassing pandas.DataFrame

Use

Class definition

Recommended topics

Hot tags