Averaging Data in Bins
Asked Answered
S

4

9

I have two lists: 1 is a depth list and the other is a chlorophyll list, which correspond to each other. I want to average chlorophyll data every 0.5 m depth.

chl  = [0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33]
depth = [0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3]

The depth bins are not always equal in length and do not always start at 0.0 or 0.5 intervals. The chlorophyll data always coordinates with depth data though. The chlorophyll averages also cannot be arranged in ascending order, they need to stay in correct order according to depth. The depth and chlorophyll lists are very long, so I can't do this individually.

How would I make 0.5 m depth bins with averaged chlorophyll data in them?

Goal:

depth = [0.5,1.0,1.5,2.0,2.5]
chlorophyll = [avg1,avg2,avg3,avg4,avg5]

For example:

avg1 = np.mean(0.4,0.1,0.04,0.05,0.4)
Sutphin answered 15/4, 2018 at 16:54 Comment(6)
would you like using pandas ?Kinsler
Is depth = [0.5,1.0,1.5,2.0,2.5) given or to be computed?Irtysh
Depth can be made with linspace. And Ya I could use pandasSutphin
Only looking for numpy/padas solutions or "normal" python as well?Cenotaph
Looking for a numpy solutionSutphin
@Sutphin You say - The depth and chlorophyll lists are very long. So, can you time the different approaches posted thus far on the actual data, assuming the performance might be of some interest? Given that NumPy, pandas, scipy based solutions have been posted, it would be interesting to see how these stack up.Irtysh
A
3

One way is to use numpy.digitize to bin your categories.

Then use a dictionary or list comprehension to calculate results.

import numpy as np

chl  = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])
depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])

bins = np.array([0,0.5,1.0,1.5,2.0,2.5])

A = np.vstack((np.digitize(depth, bins), chl)).T

res = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}

# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}

Or for the precise format you are after:

res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]

# [nan, 0.198, nan, 0.28, 0.355, 0.265]
Adam answered 15/4, 2018 at 17:9 Comment(1)
The one change I made was: bins = np.arange(0.0,50.0,0.5) because it gives me more control, but otherwise this worked wellSutphin
M
11

I'm surprised that scipy.stats.binned_statistic hasn't been mentioned yet. You can calculate the mean directly with it, and specify the bins with optional parameters.

from scipy.stats import binned_statistic

mean_stat = binned_statistic(depth, chl, 
                             statistic='mean', 
                             bins=5, 
                             range=(0, 2.5))

mean_stat.statistic
# array([0.198,   nan, 0.28 , 0.355, 0.265])
mean_stat.bin_edges
# array([0. , 0.5, 1. , 1.5, 2. , 2.5])
mean_stat.binnumber
# array([1, 1, 1, ..., 4, 5, 5])
Mercorr answered 15/4, 2018 at 17:27 Comment(1)
I find this answer very useful. Do you know if there's a way to compute a mean_stat ignoring nan values? as with np.nanmean but using binned_statistic?Applesauce
I
4

Here's a vectorized NumPy solution using np.searchsorted for getting the bin shifts (indices) and np.add.reduceat for the binned summations -

def bin_data(chl, depth, bin_start=0, bin_length= 0.5):
    # Get number of intervals and hence the bin-length-spaced depth array
    n = int(np.ceil(depth[-1]/bin_length))
    depthl = np.linspace(start=bin_start,stop=bin_length*n, num=n+1)

    # Indices along depth array where the intervaled array would have bin shifts
    idx = np.searchsorted(depth, depthl)

    # Number of elements in each bin (bin-lengths)
    lens = np.diff(idx)

    # Get summations for each bins & divide by bin lengths for binned avg o/p
    # For bins with lengths==0, set them as some invalid specifier, say NaN
    return np.where(lens==0, np.nan, np.add.reduceat(chl, idx[:-1])/lens)

Sample run -

In [83]: chl
Out[83]: 
array([0.4 , 0.1 , 0.04, 0.05, 0.4 , 0.2 , 0.6 , 0.09, 0.23, 0.43, 0.65,
       0.22, 0.12, 0.2 , 0.33])

In [84]: depth
Out[84]: 
array([0.1  , 0.3  , 0.31 , 0.44 , 0.49 , 1.1  , 1.145, 1.33 , 1.49 ,
       1.53 , 1.67 , 1.79 , 1.87 , 2.1  , 2.3  ])

In [85]: bin_data(chl, depth, bin_start=0, bin_length= 0.5)
Out[85]: array([0.198,   nan, 0.28 , 0.355, 0.265])
Irtysh answered 15/4, 2018 at 17:26 Comment(0)
A
3

One way is to use numpy.digitize to bin your categories.

Then use a dictionary or list comprehension to calculate results.

import numpy as np

chl  = np.array([0.4,0.1,0.04,0.05,0.4,0.2,0.6,0.09,0.23,0.43,0.65,0.22,0.12,0.2,0.33])
depth = np.array([0.1,0.3,0.31,0.44,0.49,1.1,1.145,1.33,1.49,1.53,1.67,1.79,1.87,2.1,2.3])

bins = np.array([0,0.5,1.0,1.5,2.0,2.5])

A = np.vstack((np.digitize(depth, bins), chl)).T

res = {bins[int(i)]: np.mean(A[A[:, 0] == i, 1]) for i in np.unique(A[:, 0])}

# {0.5: 0.198, 1.5: 0.28, 2.0: 0.355, 2.5: 0.265}

Or for the precise format you are after:

res_lst = [np.mean(A[A[:, 0] == i, 1]) for i in range(len(bins))]

# [nan, 0.198, nan, 0.28, 0.355, 0.265]
Adam answered 15/4, 2018 at 17:9 Comment(1)
The one change I made was: bins = np.arange(0.0,50.0,0.5) because it gives me more control, but otherwise this worked wellSutphin
K
3

Here is one way from pandas.cut

df=pd.DataFrame({'chl':chl,'depth':depth})
df.groupby(pd.cut(df.depth,bins=[0,0.5,1,1.5,2,2.5])).chl.mean()
Out[456]: 
depth
(0.0, 0.5]    0.198
(0.5, 1.0]      NaN
(1.0, 1.5]    0.280
(1.5, 2.0]    0.355
(2.0, 2.5]    0.265
Name: chl, dtype: float64
Kinsler answered 15/4, 2018 at 17:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.