Vectorize integration of pandas.DataFrame
Asked Answered
D

2

9

I have a DataFrame of force-displacement data. The displacement array has been set to the DataFrame index, and the columns are my various force curves for different tests.

How do I calculate the work done (which is "the area under the curve")?

I looked at numpy.trapz which seems to do what I need, but I think that I can avoid looping over each column like this:

import numpy as np
import pandas as pd 

forces = pd.read_csv(...)
work_done = {}

for col in forces.columns:
    work_done[col] = np.trapz(forces.loc[col], forces.index))

I was hoping to create a new DataFrame of the areas under the curves rather than a dict, and thought that DataFrame.apply() or something might be appropriate but don't know where to start looking.

In short:

  1. Can I avoid the looping?
  2. Can I create a DataFrame of work done directly?

Thanks in advance for any help.

Dendro answered 31/12, 2015 at 15:53 Comment(0)
F
8

You could vectorize this by passing the whole DataFrame to np.trapz and specifying the axis= argument, e.g.:

import numpy as np
import pandas as pd

# some random input data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
forces = pd.DataFrame(x, columns=names)

# vectorized version
wrk = np.trapz(forces, x=forces.index, axis=0)
work_done = pd.DataFrame(wrk[None, :], columns=forces.columns)

# non-vectorized version for comparison
work_done2 = {}
for col in forces.columns:
    work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)})

These give the following output:

from pprint import pprint

pprint(work_done.T)
#            0
# a -24.331560
# b -10.347663
# c   4.662212
# d -12.536040
# e -10.276861
# f   3.406740
# g  -3.712674
# h  -9.508454
# i  -1.044931
# j  15.165782

pprint(work_done2)
# {'a': -24.331559643023006,
#  'b': -10.347663159421426,
#  'c': 4.6622123535050459,
#  'd': -12.536039649161403,
#  'e': -10.276861220217308,
#  'f': 3.4067399176289994,
#  'g': -3.7126739591045541,
#  'h': -9.5084536839888187,
#  'i': -1.0449311137294459,
#  'j': 15.165781517623724}

There are a couple of other problems with your original example. col is a column name rather than a row index, so it needs to index the second dimension of your dataframe (i.e. .loc[:, col] rather than .loc[col]). Also, you have an extra trailing parenthesis on the last line.


Edit:

You could also generate the output DataFrame directly by .applying np.trapz to each column, e.g.:

work_done = forces.apply(np.trapz, axis=0, args=(forces.index,))

However, this isn't really 'proper' vectorization - you are still calling np.trapz separately on each column. You can see this by comparing the speed of the .apply version against calling np.trapz directly:

In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,))
1000 loops, best of 3: 582 µs per loop

In [2]: %timeit np.trapz(forces, x=forces.index, axis=0)
The slowest run took 6.04 times longer than the fastest. This could mean that an
intermediate result is being cached 
10000 loops, best of 3: 53.4 µs per loop

This isn't an entirely fair comparison, since the second version excludes the extra time taken to construct the DataFrame from the output numpy array, but this should still be smaller than the difference in time taken to perform the actual integration.

Flexuosity answered 31/12, 2015 at 17:46 Comment(4)
Excellent answer thanks. I'll test it at some point over the weekend and tick the answer if it works.Dendro
If i use the .apply option, the results of my integration is a Timedelta. Is there a way to avoid this ?Photolysis
@rubenbaetens Presumably that's because your columns contain timedelta values. What do you want the type of the result to be? You could, for example, use [total_seconds() ](pandas.pydata.org/pandas-docs/stable/generated/…) to convert the timedelta columns to floats.Flexuosity
@Flexuosity my x-axis is a series created as pd.to_datetime(time1, unit='s'). In fact, this sometimes even results in the wrong value for integration with the .apply option.Photolysis
G
2

Here's how to get the cumulative integral along a dataframe column using the trapezoidal rule. Alternatively, the following creates a pandas.Series method for doing your choice of Trapezoidal, Simpson's or Romberger's rule (source):

import pandas as pd
from scipy import integrate
import numpy as np

#%% Setup Functions

def integrate_method(self, how='trapz', unit='s'):
    '''Numerically integrate the time series.

    @param how: the method to use (trapz by default)
    @return 

    Available methods:
     * trapz - trapezoidal
     * cumtrapz - cumulative trapezoidal
     * simps - Simpson's rule
     * romb - Romberger's rule

    See http://docs.scipy.org/doc/scipy/reference/integrate.html for the method details.
    or the source code
    https://github.com/scipy/scipy/blob/master/scipy/integrate/quadrature.py
    '''
    available_rules = set(['trapz', 'cumtrapz', 'simps', 'romb'])
    if how in available_rules:
        rule = integrate.__getattribute__(how)
    else:
        print('Unsupported integration rule: %s' % (how))
        print('Expecting one of these sample-based integration rules: %s' % (str(list(available_rules))))
        raise AttributeError

    if how is 'cumtrapz':
        result = rule(self.values)
        result = np.insert(result, 0, 0, axis=0)        
    else: 
        result = rule(self.values)
    return result

pd.Series.integrate = integrate_method

#%% Setup (random) data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
df = pd.DataFrame(x, columns=names)


#%% Cummulative Integral
df_cummulative_integral = df.apply(lambda x: x.integrate('cumtrapz'))
df_integral = df.apply(lambda x: x.integrate('trapz'))

df_do_they_match = df_cummulative_integral.tail(1).round(3) == df_integral.round(3)

if df_do_they_match.all().all():
    print("Trapz produces the last row of cumtrapz")
Griefstricken answered 22/1, 2018 at 23:34 Comment(1)
copy of nbviewer.jupyter.org/gist/metakermit/5720498 , unless you are @metakermitSanctum

© 2022 - 2024 — McMap. All rights reserved.