"Reduce" function for Series
Asked Answered
C

5

31

Is there an analog for reduce for a pandas Series?

For example, the analog for map is pd.Series.apply, but I can't find any analog for reduce.


My application is, I have a pandas Series of lists:

>>> business["categories"].head()

0                      ['Doctors', 'Health & Medical']
1                                        ['Nightlife']
2                 ['Active Life', 'Mini Golf', 'Golf']
3    ['Shopping', 'Home Services', 'Internet Servic...
4    ['Bars', 'American (New)', 'Nightlife', 'Loung...
Name: categories, dtype: object

I'd like to merge the Series of lists together using reduce, like so:

categories = reduce(lambda l1, l2: l1 + l2, categories)

but this takes a horrific time because merging two lists together is O(n) time in Python. I'm hoping that pd.Series has a vectorized way to perform this faster.

Conventionalism answered 26/1, 2016 at 0:18 Comment(0)
J
34

With itertools.chain() on the values

This could be faster:

from itertools import chain
categories = list(chain.from_iterable(categories.values))

Performance

from functools import reduce
from itertools import chain

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 µs per loop

%timeit list(chain(*categories.values.flat))
1000 loops, best of 3: 237 µs per loop

%timeit reduce(lambda l1, l2: l1 + l2, categories)
100 loops, best of 3: 15.8 ms per loop

For this data set the chaining is about 68x faster.

Vectorization?

Vectorization works when you have native NumPy data types (pandas uses NumPy for its data after all). Since we have lists in the Series already and want a list as result, it is rather unlikely that vectorization will speed things up. The conversion between standard Python objects and pandas/NumPy data types will likely eat up all the performance you might get from the vectorization. I made one attempt to vectorize the algorithm in another answer.

Jibe answered 26/1, 2016 at 0:34 Comment(5)
Interesting. I'll be interested in how those optimizations for chain are implemented under the hood.Conventionalism
The reduce builds a lot of intermediate lists that all require memory allocation. Allocating memory is slow. Using chain significantly reduces the number of memory allocations.Shriner
It works. But I was hoping for a more vectorized approach. For now, I'll abstain from choosing this as the answer, even though it's very good.Conventionalism
I added a vectorized solution in another answer. But it is much slower. See explanation above why.Shriner
I just ran the performance metrics and on my machine the second algorithm consistently was ~30 µs faster. Maybe you can run them again and update the answer? Could be that some Python performance has changed.Schwerin
J
7

Vectorized but slow

You can use NumPy's concatenate:

import numpy as np

list(np.concatenate(categories.values))

Performance

But we have lists, i.e. Python objects already. So the vectorization has to switch back and forth between Python objects and NumPy data types. This make things slow:

categories = pd.Series([['a', 'b'], ['c', 'd', 'e']] * 1000)

%timeit list(np.concatenate(categories.values))
100 loops, best of 3: 7.66 ms per loop

%timeit np.concatenate(categories.values)
100 loops, best of 3: 5.33 ms per loop

%timeit list(chain.from_iterable(categories.values))
1000 loops, best of 3: 231 µs per loop
Jibe answered 28/1, 2016 at 7:42 Comment(1)
If the input were given in numpy, this would have been the faster one, correct?Somnifacient
R
1

You can try your luck with business["categories"].str.join(''), but I am guessing that Pandas uses Pythons string functions. I doubt you can do better tha what Python already offers you.

Rahal answered 28/1, 2016 at 8:23 Comment(0)
R
0

I used "".join(business["categories"])

It is much faster than business["categories"].str.join('') but still 4 times slower than the itertools.chain method. I preferred it because it is more readable and no import is required.

Reactivate answered 26/3, 2020 at 10:55 Comment(0)
I
0

if you have None values, you can filter them first

from itertools import chain

my_series = my_df.apply(lambda x: [j for j in x if j != None])
list(chain.from_iterable(my_series.values))
Immingle answered 14/11, 2023 at 14:55 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.