Manipulating data in DataFrame: how to calculate the square of a column
Asked Answered
S

2

6

I would like to calculate the square of a column A 1,2,3,4, process it with other calculation store it in column C

using CSV, DataFrames
df = DataFrame(A = 1:4, B = ["M", "F", "F", "M"])
df.C = ((((df.A./2).^2).*3.14)./1000)

Is there an easier way to write it?

Soupspoon answered 9/7, 2020 at 14:57 Comment(0)
G
4

I am not sure how much shorter you would want the formula to be, but you can write:

df.C = @. (df.A / 2) ^ 2 * 3.14 / 1000

to avoid having to write . everywhere.

Or you can use transform!, but it is not shorter (its benefit is that you can uset it in a processing pipeline, e.g. using Pipe.jl):

transform!(df, :A => ByRow(a -> (a / 2) ^ 2 * 3.14 / 1000) => :C)
Gloucester answered 9/7, 2020 at 15:33 Comment(1)
Many thanks to both of you @Bogumił Kamiński and Przemyslaw Szufel! As both your answers are equally good, I upvote both of themSoupspoon
N
3

Try this:

df.D = .5df.A .^2 * 0.00314

Explanation:

  • not so many parentheses needed
  • multiplying scalar by vector is here as good as the vectorization for short vectors (up two something like 100 elements)

A simple benchmark using BenchmarkTools:

julia> @btime $df.E = .5*$df.A .^2 * 0.00314;
  592.085 ns (9 allocations: 496 bytes)

julia> @btime $df.F = @. ($df.A / 2) ^ 2 * 0.00314;
  875.490 ns (11 allocations: 448 bytes)

The fastest is however a longer version where you provide the type information @. (df.A::Vector{Int} / 2) ^ 2 * 0.00314 (again this matters rather for short DataFrames and note that here the Z column must exist so we create it here):

julia> @btime begin $df.Z = Vector{Float64}(undef, nrow(df));@. $df.Z = ($df.A::Vector{Int} / 2.0) ^ 2.0 * 0.00314; end;
  162.564 ns (3 allocations: 208 bytes)
Neutrino answered 9/7, 2020 at 15:32 Comment(5)
Not doing broadcasting, as you propose is actually much slower than broadcasting everything, as you are allocating 3 times not 1 time.Arrearage
Actually it depends on the length of the vector (number of rows of the DataFrame) for short vectors algebraic multiplication is faster. The break-even is around 100 rows. I edited the answer.Neutrino
But this is only due to type instability of the code in the examples you benchmark. If you made the operation type stable broadcasting everything is faster.Arrearage
I added the type stable version to the benchmarks and then indeed for such short vector this is many times faster but the code is kind of long :-)Neutrino
Many thanks to both of you @Przemyslaw Szufel and Bogumił Kamiński. As both your answers are equally good, I upvote both of themSoupspoon

© 2022 - 2024 — McMap. All rights reserved.