Is there a diff() function in Julia DataFrames like pandas?
Asked Answered
U

2

8

I have a DataFrame in Julia and I want to create a new column that represents the difference between consecutive rows in a specific column. In python pandas, I would simply use df.series.diff(). Is there a Julia equivelant?

For example:

data
1
2
4
6
7

# in pandas

df['diff_data'] = df.data.diff()

data   diff_data
1        NaN 
2          1
4          2
6          2
7          1
Uzia answered 9/6, 2021 at 12:59 Comment(2)
diff(A; dims), and for a DataFrame, diff.(eachcol(df))Unscientific
@Unscientific I get this error: ``` MethodError: no method matching diff(::SentinelArrays.ChainedVector{Int64,Array{Int64,1}}, ::Int64) Closest candidates are: diff(::AbstractArray{T,1} where T) at multidimensional.jl:809 diff(::AbstractArray{T,N}; dims) where {T, N} at multidimensional.jl:841 ```Uzia
B
12

You can use ShiftedArrays.jl like this.

Declarative style:

julia> using DataFrames

julia> using ShiftedArrays: lag

julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
 Row │ data
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     4
   4 │     6
   5 │     7

julia> transform(df, :data => (x -> x - lag(x)) => :data_diff)
5×2 DataFrame
 Row │ data   data_diff
     │ Int64  Int64?
─────┼──────────────────
   1 │     1    missing
   2 │     2          1
   3 │     4          2
   4 │     6          2
   5 │     7          1

Imperative style (in place):

julia> df = DataFrame(data=[1, 2, 4, 6, 7])
5×1 DataFrame
 Row │ data
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     4
   4 │     6
   5 │     7

julia> df.data_diff = df.data - lag(df.data)
5-element Vector{Union{Missing, Int64}}:
  missing
 1
 2
 2
 1

julia> df
5×2 DataFrame
 Row │ data   data_diff
     │ Int64  Int64?
─────┼──────────────────
   1 │     1    missing
   2 │     2          1
   3 │     4          2
   4 │     6          2
   5 │     7          1

with diff you do not need extra packages and can do similarly the following:

julia> df.data_diff = [missing; diff(df.data)]
5-element Vector{Union{Missing, Int64}}:
  missing
 1
 2
 2
 1

(the issue is that diff is a general purpose function that does change the length of vector from n to n-1 so you have to add missing manually in front)

Brenneman answered 9/6, 2021 at 13:41 Comment(4)
Adding to this awesome answer by Bogumil, you may also be interested in the TimeSeries.jl package which has "dataframes" with a special column containing times. They have a lag function as well: juliastats.org/TimeSeries.jl/stable/apply/#lag-1Georgiannegeorgic
Indeed - time delta aware lagging is a common and very important requirement and it is not covered by my answer.Hesterhesther
I get an error that "lag" is not defined. Has it been removed from ShiftedArrays?Yardstick
I have updated the answer. It is defined but not exported.Hesterhesther
S
1

Pandas df.diff() does it to the whole data frame at once and allows you to specify row-wise or column-wise. There might be a better way but this is what I used before (I like chaining or piping like in dplyr):

# using chain.jl

@chain df begin
    eachcol()
    diff.()
    DataFrame(:auto)
    rename!(names(df))
end

# OR base pipe

df |>
    x -> eachcol(x) |>
    x -> diff.(x) |>
    x -> DataFrame(x, :auto) |>
    x -> rename!(x, names(df)[2:end])


# OR without piping 

rename!(DataFrame(diff.(eachcol(df)), :auto), names(df))

You might need to insert the starting row, which will now have missing values.

Sciurine answered 15/1, 2023 at 5:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.