NumPy or SciPy to calculate weighted median
Asked Answered
W

5

15

I'm trying to automate a process that JMP does (Analyze->Distribution, entering column A as the "Y value", using subsequent columns as the "weight" value). In JMP you have to do this one column at a time - I'd like to use Python to loop through all of the columns and create an array showing, say, the median of each column.

For example, if the mass array is [0, 10, 20, 30], and the weight array for column 1 is [30, 191, 9, 0], the weighted median of the mass array should be 10. However, I'm not sure how to arrive at this answer.

So far I've

  1. imported the csv showing the weights as an array, masking values of 0, and
  2. created an array of the "Y value" the same shape and size as the weights array (113x32). I'm not entirely sure I need to do this, but thought it would be easier than a for loop for the purpose of weighting.

I'm not sure exactly where to go from here. Basically the "Y value" is a range of masses, and all of the columns in the array represent the number of data points found for each mass. I need to find the median mass, based on the frequency with which they were reported.

I'm not an expert in Python or statistics, so if I've omitted any details that would be useful let me know!

Update: here's some code for what I've done so far:

#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt

inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)

#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)

#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]

for i in range (rowLength):
    createArr = np.arange(0, fieldLength*10, 10)
    nCreateArr = np.array(createArr)
    massArr.append(nCreateArr)
    nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()
Woolpack answered 16/12, 2013 at 0:52 Comment(1)
Some example input\output data would be helpful, also try to show the code for how far you've come so far.Barbate
S
12

Since this is the top hit on Google for weighted median in NumPy, I will add my minimal function to select the weighted median from two arrays without changing their contents, and with no assumptions about the order of the values (on the off-chance that anyone else comes here looking for a quick recipe for the same exact pre-conditions).

def weighted_median(values, weights):
    i = np.argsort(values)
    c = np.cumsum(weights[i])
    return values[i[np.searchsorted(c, 0.5 * c[-1])]]

Using argsort lets us maintain the alignment between the two arrays without changing or copying their content. It should be straight-forward to extend it to an arbitrary number of arbitrary quantiles.

Update

Since it may not be fully obvious at first blush exactly how easy it is to extend to arbitrary quantiles, here is the code:

def weighted_quantiles(values, weights, quantiles=0.5):
    i = np.argsort(values)
    c = np.cumsum(weights[i])
    return values[i[np.searchsorted(c, np.array(quantiles) * c[-1])]]

This defaults to median, but you can pass in any quantile, or a list of quantiles. The return type is equivalent to what you pass in as quantiles, with lists promoted to NumPy arrays. With enough uniformly distributed values, you can indeed approximate the input poorly:

>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.01, 0.05, 0.25, 0.50, 0.75, 0.95, 0.99])
array([0.01235101, 0.05341077, 0.25355715, 0.50678338, 0.75697424,0.94962936, 0.98980785])
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), 0.5)
0.5036283072043176
>>> weighted_quantiles(np.random.rand(10000), np.random.rand(10000), [0.5])
array([0.49851076])

Update 2

In small data sets where the median/quantile is not actually observed, it may be important to be able to interpolate a point between two observations. This can be fairly easily added by calculating the mid point between two values in the case where the weight mass is equally (or quantile/1-quantile) divided between them. Due to the need for a conditional, this function always returns a NumPy array, even when quantiles is a single scalar. The inputs also need to be NumPy arrays now (except quantiles that may still be a single number).

def weighted_quantiles_interpolate(values, weights, quantiles=0.5):
    i = np.argsort(values)
    c = np.cumsum(weights[i])
    q = np.searchsorted(c, quantiles * c[-1])
    return np.where(c[q]/c[-1] == quantiles, 0.5 * (values[i[q]] + values[i[q+1]]), values[i[q]])

This function will fail with arrays smaller than 2 (the original would handle non-empty arrays).

>>> weighted_quantiles_interpolate(np.array([2, 1]), np.array([1, 1]), 0.5)
array(1.5)

Note that this extension is fairly unlikely to be needed when working with actual data sets where we typically have (a) large data sets, and (b) real-valued weights that make the odds of ending up exactly at a quantile edge very long, and probably due to rounding errors when it does happen. Including it for completeness nonetheless.

Sarraceniaceous answered 30/9, 2022 at 7:49 Comment(7)
It is probably OK for large samples, but not quite accurate for small samples. weighted_median([1, 2, 3, 4], [1, 1, 1, 1]) == 2 instead of the correct value 2.5. Looking at the wquantiles module provided above, to have centered value you need to: 1) use np.interp instead of np.searchsorted 2) retrieve half a weight to the cumulative weightForehand
But to be fair, as I was trying to reproduce the results of this paper, your function worked like a charm (suggesting that correct or not -- and probably correctly, the authors of that scientific paper used a similar formulation as yours)Forehand
PS: they cited mitpress.mit.edu/9780262046305/introduction-to-algorithms as referenceForehand
@Mahé That is correct, there is no interpolation in the case where the median is not observed. I'll have a think about whether it would still be a minimal solution with eg. linear interpolation, or if that's the point where you're better off adding in a dependency on a third-party library. I'll update the answer to make this precondition explicit though, thanks!Sarraceniaceous
@Mahé It is actually a fairly small update, so enjoy the minor extension :-)Sarraceniaceous
Sounds good. I also posted my own, for completeness, as a synthesis between your original approach and @muzzle's wquantlies function. I find it more convenient to have one function, but that is a matter of preference. I recommend the use of np.interp, which is clearer and faster (I assume np.where consumes up the time).Forehand
@Mahé Awesome, the more the merrier! :-) running timeit with larger datasets (10k random numbers) and more quantiles gives a slight advantage to my implementation (on my computer), which just goes to show how important it is to profile your own specific use case!Sarraceniaceous
B
8

What we can do, if i understood your problem correctly. Is to sum up the observations, dividing by 2 would give us the observation number corresponding to the median. From there we need to figure out what observation this number was.

One trick here, is to calculate the observation sums with np.cumsum. Which gives us a running cumulative sum.

Example:
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
Each element is the sum of all previously elements and itself. We have 10 observations here. so the mean would be the 5th observation. (We get 5 by dividing the last element by 2).
Now looking at the cumsum result, we can easily see that that must be the observation between the second and third elements (observation 3 and 6).

So all we need to do, is figure out the index of where the median (5) will fit.
np.searchsorted does exactly what we need. It will find the index to insert an elements into an array, so that it stays sorted.

The code to do it like so:

import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])

c = np.cumsum(freq_count, axis=1) 
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...

#This is just for explanation.
print "median masses is:",  masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))

Output will be:

median masses is: [10 20 20  0 30]  
[[ 30 191   9   0]  <- The test data
 [ 10  20 300  10]  
 [ 10  20  30  40]  
 [100  10  10  10]  
 [  1   1   1 100]]  
[[  30.   221.   230.   230.   115. ]  <- cumsum results with median added to the end.
 [  10.    30.   330.   340.   170. ]     you can see from this where they fit in.
 [  10.    30.    60.   100.    50. ]  
 [ 100.   110.   120.   130.    65. ]  
 [   1.     2.     3.   103.    51.5]]  
Barbate answered 16/12, 2013 at 2:1 Comment(5)
Thanks so much for your explanation! I'm getting close but am not quite there yet. I don't think I articulated my problem quite right - basically, the median should always be a number within the range of masses - the frequencies of [30, 191, 9, 0] correspond with masses [0, 10, 20, 30], respectively (i.e. mass in range 0-10 showed up 30 times, mass of 10-20 showed up 191 times, etc.). With your answer above it looks like I'm getting the median of the frequency count instead, right?Woolpack
Yes, it finds the median of the frequency count, and then relates that to the masses. Using that the ranges of masses are directly related to the elements of the frequency count. Do you need it to figure out the true median or the range that contains the median? This will find the range containing the median.Barbate
Could you try to either give more example of inputs, outputs. Or check the "test data" i used and say what the output should be for them.Barbate
Ideally I'd find the true median, but the range would also be fine. using your test data, I found medians of [20, 25, 25, 25, 25], respectively. Here's some actual data [30, 191, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 99, 256, 254, 82, 5, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 65, 205, 189, 249, 120, 72, 40, 2, 0, 0, 0], [0, 0, 0, 0, 0, 1, 59, 192, 324, 204, 188, 127, 104, 29]. These correspond with masses from 0-130, counting by 10s. The medians using JMP: [10, 30, 65, 90].Woolpack
The medians using your edits are [125.5, 348, 471, and 614]. This looks like it's getting there - they're getting consecutively larger, which follows the same pattern as JMP. I'll tinker around with it to see if there's a small tweak that will get it the rest of the way, but would appreciate any more input you've got! At a glance it may be something with the indices formula - instead of 0-130 by 10's, I'm getting 0, 10, 50, 80 as the output (after modifying it to (i-1)*10 to start at 0).Woolpack
F
6

wquantiles is a small python package that will do exactly what you need. It just uses np.cumsum() and np.interp() under the hood.

Fabrianna answered 18/3, 2020 at 12:42 Comment(1)
I've had a proper implementation of weighted introselect on my back burner for a couple of years now :(Ereshkigal
V
2

I ended up writing that function based on @muzzle and @maesers replies:

def weighted_quantiles(values, weights, quantiles=0.5, interpolate=False):

    i = values.argsort()
    sorted_weights = weights[i]
    sorted_values = values[i]
    Sn = sorted_weights.cumsum()

    if interpolate:
        Pn = (Sn - sorted_weights/2 ) / Sn[-1]
        return np.interp(quantiles, Pn, sorted_values)
    else:
        return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]

The difference between interpolate True and False is as follows:

weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4))
> 2 
weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), interpolate=True)
> 2.5

(there is no difference for uneven arrays such as [1, 2, 3, 4, 5])

Speed tests show it is just as performant as @maesers' function in the uninterpolated case, and it is twice as performant in the interpolated case.

enter image description here

Vitelline answered 2/2, 2023 at 9:59 Comment(0)
W
0

Sharing some code that I got a hand with. This allows you to run stats on each column of an excel spreadsheet.

import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain

book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'

masses = sh.col_values(0, start_rowx=1)  # first column has mass
age = sh.row_values(0, start_colx=1)   # first row has age ranges

count = 1
mass = []
for a in ages:
    age.append(sh.col_values(count, start_rowx=1))
    count += 1

stats = []
count = 0    
for a in ages:
    expanded = []
    # create a tuple with the mass vector

    age_mass = zip(masses, age[count])
    count += 1
    # replicate element[0] for element[1] times
    expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)

    #  separate into one big list
    medianlist = [x for t in expanded for x in t]

    # convert to array and mask out zeroes
    npa = np.array(medianlist)
    npa = np.ma.masked_equal(npa,0)

    median = np.median(npa)
    meanMass = np.average(npa)
    maxMass = np.max(npa)
    minMass = np.min(npa)
    stdev = np.std(npa)

    stats1 = [median, meanMass, maxMass, minMass, stdev]
    print stats1

    stats.append(stats1)

np.savetxt(ofile, (stats), fmt="%d") 
Woolpack answered 24/3, 2014 at 9:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.