Groupby with sum on Julia Dataframe
Asked Answered
O

2

8

I am trying to make a groupby + sum on a Julia Dataframe with Int and String values

For instance, df :

│ Row │ A      │ B      │ C     │ D      │
│     │ String │ String │ Int64 │ String │
├─────┼────────┼────────┼───────┼────────┤
│ 1   │ x1     │ a      │ 12    │ green  │
│ 2   │ x2     │ a      │ 7     │ blue   │
│ 3   │ x1     │ b      │ 5     │ red    │
│ 4   │ x2     │ a      │ 4     │ blue   │
│ 5   │ x1     │ b      │ 9     │ yellow │

To do this in Python, the command could be :

df_group = df.groupby(['A', 'B']).sum().reset_index()

I will obtain the following output result with the initial column labels :

    A  B   C
0  x1  a  12
1  x1  b  14
2  x2  a  11

I would like to do the same thing in Julia. I tried this way, unsuccessfully :

df_group = aggregate(df, ["A", "B"], sum)

MethodError: no method matching +(::String, ::String)

Have you any idea of a way to do this in Julia ?

Oodles answered 6/10, 2020 at 13:29 Comment(0)
M
7

Try (actually instead of non-string columns, probably you want columns that are numeric):

numcols = names(df, findall(x -> eltype(x) <: Number, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum .=> numcols)

and if you want to allow missing values (and skip them when doing a summation) then:

numcols = names(df, findall(x -> eltype(x) <: Union{Missing,Number}, eachcol(df)))
combine(groupby(df, ["A", "B"]), numcols .=> sum∘skipmissing .=> numcols)
Meridethmeridian answered 6/10, 2020 at 14:12 Comment(2)
Thanks for the answer. In practice my real dataframe has many columns to sum (+100). So I would like to find a more generic way to do. Is this possible to make the sum on all non string columns ?Oodles
It is possible on master (so it will be available in 0.22 release). I will update the answer with the recommendation that can be used now.Lohengrin
R
5

Julia DataFrames support split-apply-combine logic, similar to pandas, so aggregation looks like

using DataFrames

df = DataFrame(:A => ["x1", "x2", "x1", "x2", "x1"], 
               :B => ["a", "a", "b", "a", "b"],
               :C => [12, 7, 5, 4, 9],
               :D => ["green", "blue", "red", "blue", "yellow"])

gdf = groupby(df, [:A, :B])
combine(gdf, :C => sum)

with the result

julia> combine(gdf, :C => sum)
3×3 DataFrame
│ Row │ A      │ B      │ C_sum │
│     │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ x1     │ a      │ 12    │
│ 2   │ x2     │ a      │ 11    │
│ 3   │ x1     │ b      │ 14    │

You can skip the creation of gdf with the help of Pipe.jl or Underscores.jl

using Underscores

@_ groupby(df, [:A, :B]) |> combine(__, :C => sum)

You can give name to the new column with the following syntax

julia> @_ groupby(df, [:A, :B]) |> combine(__, :C => sum => :C)
3×3 DataFrame
│ Row │ A      │ B      │ C     │
│     │ String │ String │ Int64 │
├─────┼────────┼────────┼───────┤
│ 1   │ x1     │ a      │ 12    │
│ 2   │ x2     │ a      │ 11    │
│ 3   │ x1     │ b      │ 14    │
Rotgut answered 6/10, 2020 at 14:13 Comment(2)
Thanks for your answer.Oodles
what if I want to do a groupby sum and count in the same line?Deadly

© 2022 - 2024 — McMap. All rights reserved.