How to do data analysis (like counts, ucounts, frequency) with pandas?

Asked 30/5, 2021 at 4:13 Answered 6/6, 2021 at 9:33

python pandas pandas-groupby data-analysis

I have DataFrame like below:

df = pd.DataFrame([
    ("i", 1, 'GlIrbixGsmCL'),
    ("i", 1, 'GlIrbixGsmCL'),
    ("i", 1, '3IMR1UteQA'),
    ("c", 1, 'GlIrbixGsmCL'),
    ("i", 2, 'GlIrbixGsmCL'),
], columns=['type', 'cid', 'userid'])

Expected output like:

For more details:

i_counts, c_counts      => df.groupby(["cid","type"]).size()
i_ucounts, c_ucounts    => df.groupby(["cid","type"])["userid"].nunique()
i_frequency,u_frequency => df.groupby(["cid","type"])["userid"].value_counts()

Looks it's a little complex for me, how to do with pandas to get the expected result?

The related screenshots:

Decamp answered 30/5, 2021 at 4:13 Comment(2)

Have a look at agg function – Semblable 30/5, 2021 at 4:15

I also tried code df.groupby(["cid","type"]).agg(counts=("userid", np.size), ucounts=("userid", "nunique")).reset_index() , but don't know how to do the next to get what I want as expect – Decamp 30/5, 2021 at 4:30

This is how I would approach this:

aggfuncs= {
    'counts': ('userid', 'count'), 
    'ucounts': ('userid', 'nunique'),
    'frequency': ('userid', lambda S: S.value_counts().to_dict()),
}

output = df.groupby(['cid', 'type']).agg(**aggfuncs).unstack()
output.columns = output.columns.map(lambda tup: '_'.join(tup[::-1]))

output:

     c_counts  i_counts  c_ucounts  i_ucounts          c_frequency                           i_frequency
cid
1         1.0       3.0        1.0        2.0  {'GlIrbixGsmCL': 1}  {'GlIrbixGsmCL': 2, '3IMR1UteQA': 1}
2         NaN       1.0        NaN        1.0                  NaN                   {'GlIrbixGsmCL': 1}

I think that is the core of what you want. You will need some cosmetic amendments to get the output exactly as in your example (e.g. fillna etc.).

Nedranedrah answered 6/6, 2021 at 9:33 Comment(0)

STEPS:

Extract the id_numbers from user_id and convert them to int type.
Use groupby and agg to evaluate count/ucount / `frequency.
use pivot to restructure the table.
flatten the columns and reset_index if required.

df['userid'] = df.userid.str.extract(r'(\d+)').astype(int)
k = df.groupby(["type", 'cid']).agg(count=('userid', 'count'), ucount=(
    'userid', 'nunique'), frequency=('userid', lambda x: x.value_counts().to_dict())).reset_index()
k = k.pivot(index=[k.index, 'cid'], columns='type').fillna(0)

OUTPUT:

      count      ucount      frequency              
type      c    i      c    i         c             i
  cid                                               
0 1     1.0  0.0    1.0  0.0    {1: 1}             0
1 1     0.0  3.0    0.0  2.0         0  {1: 2, 2: 1}
2 2     0.0  1.0    0.0  1.0         0        {1: 1}

Then to transform columns:

k.columns = k.columns.map(lambda x: '_'.join(x[::-1]))

OUTPUT:

       c_count  i_count  c_ucount  i_ucount c_frequency   i_frequency
  cid                                                                
0 1        1.0      0.0       1.0       0.0      {1: 1}             0
1 1        0.0      3.0       0.0       2.0           0  {1: 2, 2: 1}
2 2        0.0      1.0       0.0       1.0           0        {1: 1}

updated answer (according to your edited question):

k = df.groupby(["type" , 'cid']).agg(count = ('userid' ,'count') , ucount = ('userid', 'nunique') , frequency=('userid', lambda x: x.value_counts().to_dict())).reset_index()
k = k.pivot(index=['cid'], columns ='type').fillna(0)

OUTPUT:

    count   ucount  frequency
type    c   i   c   i   c   i
cid                     
1   1.0 3.0 1.0 2.0 {'userid001': 1}    {'userid001': 2, 'userid002': 1}
2   0.0 1.0 0.0 1.0 0   {'userid001': 1}

NOTE: use df.userid = df.userid.factorize()[0] to encode userid if required.

Sagesagebrush answered 30/5, 2021 at 4:56 Comment(6)

I tried your code, looks I got Error like with df.assign code => KeyError: "[('userid', '<lambda>')] not in index" – Decamp 30/5, 2021 at 5:10

But is this what you need? or not? @SilenceHe. – Sagesagebrush 30/5, 2021 at 5:12

By the way, the userid is somthing like random user cookie, so they are not regular string like {userid + number}, so maybe df.userid.str.split('userid').str[1].astype(int) is not suit for this scenario. – Decamp 30/5, 2021 at 5:36

yes, it's what I what, looks I have something wrong with the output result, will update the result for this question – Decamp 30/5, 2021 at 5:42

Made few more changes @SilenceHe – Sagesagebrush 30/5, 2021 at 5:49

Sorry for the wrong output result before, already updated the question and show correctly output, row index 0 and 1 should be merged to the 1 row but not 2 row, and the real userid string like "GlIrbixGsmCL" for cookie id used, which means it can't covert to int type – Decamp 30/5, 2021 at 5:57

STEPS:

OUTPUT:

updated answer (according to your edited question):

OUTPUT:

Recommended topics

Hot tags