julia create an empty dataframe and append rows to it
Asked Answered
G

4

49

I am trying out the Julia DataFrames module. I am interested in it so I can use it to plot simple simulations in Gadfly. I want to be able to iteratively add rows to the dataframe and I want to initialize it as empty.

The tutorials/documentation on how to do this is sparse (most documentation describes how to analyse imported data).

To append to a nonempty dataframe is straightforward:

df = DataFrame(A = [1, 2], B = [4, 5])
push!(df, [3 6])

This returns.

3x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 1 | 4 |
| 2   | 2 | 5 |
| 3   | 3 | 6 |

But for an empty init I get errors.

df = DataFrame(A = [], B = [])
push!(df, [3, 6])

Error message:

ArgumentError("Error adding 3 to column :A. Possible type mis-match.")
while loading In[220], in expression starting on line 2

What is the best way to initialize an empty Julia DataFrame such that you can iteratively add items to it later in a for loop?

Georgetown answered 5/10, 2014 at 8:39 Comment(1)
I could not reproduce this error message in DataFrames v. 0.7.4 on Julia 0.4.5.Choler
P
48

A zero length array defined using only [] will lack sufficient type information.

julia> typeof([])
Array{None,1}

So to avoid that problem is to simply indicate the type.

julia> typeof(Int64[])
Array{Int64,1}

And you can apply that to your DataFrame problem

julia> df = DataFrame(A = Int64[], B = Int64[])
0x2 DataFrame

julia> push!(df, [3  6])

julia> df
1x2 DataFrame
| Row | A | B |
|-----|---|---|
| 1   | 3 | 6 |
Printable answered 5/10, 2014 at 8:59 Comment(2)
What about the situation where you know how big the DataFrame will wind up? Is it more efficient to initialize it up front? For example DataFrame(Int64, 1000, 10) Then what's the syntax for populating each row as you iterate?Censer
@Censer I was wondering the same thing. Eventually found out and added an answer :)Llanes
M
5

The answer from @waTeim already answers the initial question. But what if I want to dynamically create an empty DataFrame and append rows to it. E.g. what if I don't want hard-coded column names?

In this case, df = DataFrame(A = Int64[], B = Int64[]) is not sufficient. The NamedTuple A = Int64[], B = Int64[] needs to be create dynamically.

Let's assume we have a vector of column names col_names and a vector of column types colum_types from which to create an emptyDataFrame.

col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)

df = DataFrame(named_tuple) # 0×2 DataFrame

Alternatively, the NameTuple could be created with

# or by doing
named_tuple = NamedTuple{Tuple(col_names)}(type[] for type in col_types )
Mercer answered 28/7, 2021 at 9:39 Comment(0)
L
4
using Pkg, CSV, DataFrames

iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"))

new_iris = similar(iris, nrow(iris))

head(new_iris, 2)
# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1   │ missing     │ missing    │ missing     │ missing    │ missing │
# │ 2   │ missing     │ missing    │ missing     │ missing    │ missing │

for (i, row) in enumerate(eachrow(iris))
    new_iris[i, :] = row[:]
end

head(new_iris, 2)

# 2×5 DataFrame
# │ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
# ├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
# │ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ setosa  │
# │ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │
Llanes answered 24/9, 2018 at 19:48 Comment(0)
I
2

I think at least in the latest version of Julia you can achieve this by creating a pair object without specifying type

df = DataFrame("A" => [], "B" => [])
push!(df, [5,'f'])

1×2 DataFrame
 Row │ A    B   
     │ Any  Any 
─────┼──────────
   1 │ 5    f

as seen in this post by @Bogumił Kamiński where multiple columns are needed, something like this can be done:

entries = ["A", "B", "C", "D"]
df = DataFrame([ name =>[] for name in entries])
julia> push!(df,[4,5,'r','p'])
1×4 DataFrame
 Row │ A    B    C    D   
     │ Any  Any  Any  Any 
─────┼────────────────────
   1 │ 4    5    r    p

Or as pointed out by @Antonello below if you know that type you can do.

df = DataFrame([name => Int[] for name in entries])

which is also in @Bogumil Kaminski's original post.

Involuntary answered 17/10, 2022 at 21:18 Comment(2)
Still having the inner columns being of Any type would not be the best for performances. If you know the type the columns will host, use this info.Kith
Got it! thanks for pointing that out!Involuntary

© 2022 - 2024 — McMap. All rights reserved.