I'm fairly new to python and pandas (from using SAS as my workhorse analytical platform), so I apologize in advance if this has already been asked / answered. (I've searched through the documentation as well as this site searching for answer and haven't been able to find something yet.)
I've got a dataframe (called resp) containing respondent level survey data. I want to perform some basic descriptive statistics on one of the fields (called anninc [short for annual income]).
resp["anninc"].describe()
Which gives me the basic stats:
count 76310.000000
mean 43455.874862
std 33154.848314
min 0.000000
25% 20140.000000
50% 34980.000000
75% 56710.000000
max 152884.330000
dtype: float64
But there's a catch. Given how the sample was built, there was a need to weight adjust the respondent data so that not every one is deemed as "equal" when performing the analysis. I have another column in the dataframe (called tufnwgrp) that represents the weight that should be applied to each record during the analysis.
In my prior SAS life, most of the proc's have options to process data with weights like this. For example, a standard proc univariate to give the same results would look something like this:
proc univariate data=resp;
var anninc;
output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count;
run;
And the same analysis using weighted data would look something like this:
proc univariate data=resp;
var anninc;
weight tufnwgrp;
output out=resp_univars mean=mean median=50pct q1=25pct q3=75pct min=min max=max n=count
run;
Is there a similar sort of weighting option available in pandas for methods like describe() etc?
(df['anninc'] * df['tufnwgrp']).describe()
would do the trick. You may have to convert the dtypes at some point. – Eupatorium