Median from frequency table with `df.apply()`
Asked Answered
M

1

0

I have a census frequency distribution and want to calculate the median please.

import pandas as pd
import math
import numpy as np

geo_code                1     2     3     4     5     6     7
0                     815  1026   735  1344   569  2688   741
 1228801 -  2457600   305   104    74   177    84    10    40
 153601 -  307200    2028  2330  2341  1720  1757   585  1695
 19201 -  38400       408   642   505  2002   377  2495   747
 1 -  4800             28    38    31   288    54   553    51
2500000               129    67    81    85    69    10    43
 307201 -  614400    2044  1903  1775  1611  1833   262  1272
 38401 -   76800      613  1202   944  1706   729  1499   862
 4801 -  9600          52    56    60   328    43   848    92
 614401-  1228800    1254   627   528   773   702    58   229
 76801 -  153600     1305  1943  1741  1516  1264   771  1132
 9601 -  19200        167   401   237  1048   248  1762   425
00                      2     1     0     1     0     0     0
df['new'] = df.index
df[['Upper', 'Lower']] = df['new'].str.split('-', expand=True)
df["Lower"] = df["Lower"].fillna(0)
df['Xi'] = (df['Upper'].astype(float) + df['Lower'].astype(float))/2

print(df.head(2))
geo_code                1     2     3     4     5     6     7            new          Upper          Lower              Xi  
0                     815  1026   735  1344   569  2688   741              0          0                0                0.0
 1228801 -  2457600   305   104    74   177    84    10    40  1228801 -  2457600     1228801     2457600         1843200.5 
 153601 -  307200    2028  2330  2341  1720  1757   585  1695   153601 -  307200       153601      307200          230400.5  

Now the function to calculate the median would be:

def median_(val, freq):
    ord = np.argsort(val)
    cdf = np.cumsum(freq[ord])
    return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]

where val is df.Xi and freq is a column (1 through 116. culled here for minimum working example)

How do I parse this to df.apply() so that the result would be a new row? Possibly something like: df.loc['median'] = df.apply(... ) with each median under its respective column?

Meade answered 12/9, 2023 at 19:44 Comment(0)
H
0

You can count mean of splitted values in index, so possible count new row by all columns:

s = df.index.to_series().str.split('\s*-\s*', expand=True).astype(float).mean(axis=1)

def median_(val, freq):
    ord = np.argsort(val)
    cdf = np.cumsum(freq[ord])
    return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]

df.loc['median'] = df.apply(lambda col: median_(s, col))

print (df)
                         1         2         3        4         5        6  \
geo_code                                                                      
0                     815.0    1026.0     735.0   1344.0     569.0   2688.0   
1228801 - 2457600     305.0     104.0      74.0    177.0      84.0     10.0   
153601 - 307200      2028.0    2330.0    2341.0   1720.0    1757.0    585.0   
19201 - 38400         408.0     642.0     505.0   2002.0     377.0   2495.0   
1 - 4800               28.0      38.0      31.0    288.0      54.0    553.0   
2500000               129.0      67.0      81.0     85.0      69.0     10.0   
307201 - 614400      2044.0    1903.0    1775.0   1611.0    1833.0    262.0   
38401 - 76800         613.0    1202.0     944.0   1706.0     729.0   1499.0   
4801 - 9600            52.0      56.0      60.0    328.0      43.0    848.0   
614401- 1228800      1254.0     627.0     528.0    773.0     702.0     58.0   
76801 - 153600       1305.0    1943.0    1741.0   1516.0    1264.0    771.0   
9601 - 19200          167.0     401.0     237.0   1048.0     248.0   1762.0   
00                      2.0       1.0       0.0      1.0       0.0      0.0   
median             230400.5  115200.5  230400.5  57600.5  230400.5  14400.5   

                          7  
geo_code                     
0                     741.0  
1228801 - 2457600      40.0  
153601 - 307200      1695.0  
19201 - 38400         747.0  
1 - 4800               51.0  
2500000                43.0  
307201 - 614400      1272.0  
38401 - 76800         862.0  
4801 - 9600            92.0  
614401- 1228800       229.0  
76801 - 153600       1132.0  
9601 - 19200          425.0  
00                      0.0  
median             115200.5  

If need new column Xi:

df['Xi']=df.index.to_series().str.split('\s*-\s*', expand=True).astype(float).mean(axis=1)

def median_(val, freq):
    ord = np.argsort(val)
    cdf = np.cumsum(freq[ord])
    return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]

df.loc['median'] = df.apply(lambda col: median_(df['Xi'], col))

print (df)
                          1         2         3        4         5        6  \
geo_code                                                                      
0                     815.0    1026.0     735.0   1344.0     569.0   2688.0   
1228801 - 2457600     305.0     104.0      74.0    177.0      84.0     10.0   
153601 - 307200      2028.0    2330.0    2341.0   1720.0    1757.0    585.0   
19201 - 38400         408.0     642.0     505.0   2002.0     377.0   2495.0   
1 - 4800               28.0      38.0      31.0    288.0      54.0    553.0   
2500000               129.0      67.0      81.0     85.0      69.0     10.0   
307201 - 614400      2044.0    1903.0    1775.0   1611.0    1833.0    262.0   
38401 - 76800         613.0    1202.0     944.0   1706.0     729.0   1499.0   
4801 - 9600            52.0      56.0      60.0    328.0      43.0    848.0   
614401- 1228800      1254.0     627.0     528.0    773.0     702.0     58.0   
76801 - 153600       1305.0    1943.0    1741.0   1516.0    1264.0    771.0   
9601 - 19200          167.0     401.0     237.0   1048.0     248.0   1762.0   
00                      2.0       1.0       0.0      1.0       0.0      0.0   
median             230400.5  115200.5  230400.5  57600.5  230400.5  14400.5   

                          7         Xi  
geo_code                                
0                     741.0        0.0  
1228801 - 2457600      40.0  1843200.5  
153601 - 307200      1695.0   230400.5  
19201 - 38400         747.0    28800.5  
1 - 4800               51.0     2400.5  
2500000                43.0  2500000.0  
307201 - 614400      1272.0   460800.5  
38401 - 76800         862.0    57600.5  
4801 - 9600            92.0     7200.5  
614401- 1228800       229.0   921600.5  
76801 - 153600       1132.0   115200.5  
9601 - 19200          425.0    14400.5  
00                      0.0        0.0  
median             115200.5  1843200.5  
Hazing answered 13/9, 2023 at 7:46 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.