I have a census frequency distribution and want to calculate the median
please.
import pandas as pd
import math
import numpy as np
geo_code 1 2 3 4 5 6 7
0 815 1026 735 1344 569 2688 741
1228801 - 2457600 305 104 74 177 84 10 40
153601 - 307200 2028 2330 2341 1720 1757 585 1695
19201 - 38400 408 642 505 2002 377 2495 747
1 - 4800 28 38 31 288 54 553 51
2500000 129 67 81 85 69 10 43
307201 - 614400 2044 1903 1775 1611 1833 262 1272
38401 - 76800 613 1202 944 1706 729 1499 862
4801 - 9600 52 56 60 328 43 848 92
614401- 1228800 1254 627 528 773 702 58 229
76801 - 153600 1305 1943 1741 1516 1264 771 1132
9601 - 19200 167 401 237 1048 248 1762 425
00 2 1 0 1 0 0 0
df['new'] = df.index
df[['Upper', 'Lower']] = df['new'].str.split('-', expand=True)
df["Lower"] = df["Lower"].fillna(0)
df['Xi'] = (df['Upper'].astype(float) + df['Lower'].astype(float))/2
print(df.head(2))
geo_code 1 2 3 4 5 6 7 new Upper Lower Xi
0 815 1026 735 1344 569 2688 741 0 0 0 0.0
1228801 - 2457600 305 104 74 177 84 10 40 1228801 - 2457600 1228801 2457600 1843200.5
153601 - 307200 2028 2330 2341 1720 1757 585 1695 153601 - 307200 153601 307200 230400.5
Now the function
to calculate the median would be:
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
where val
is df.Xi
and freq
is a column (1 through 116. culled here for minimum working example)
How do I parse this to df.apply()
so that the result would be a new row? Possibly something like: df.loc['median'] = df.apply(... )
with each median
under its respective column?