Breaking change on vcat when columns are missing

Asked 26/7, 2018 at 17:17 Answered 2/5, 2023 at 13:13

With Julia 0.5 I was used to do this:

A = DataFrame(ID = [20,40], Name = ["John Doe", "Jane Doe"])
B = DataFrame(ID = [60,80], Job = ["Sailor", "Sommelier"])
C = DataFrame(Year = [1978, 1982], Test = ["Something", "Somewhere"])
vcat(A,B,C)

Now I am trying to replicate the same with v0.6.4 and I get an error instead:

ArgumentError: column(s) Job, Year and Test are missing from argument(s) 1, column(s) Name, Year and Test are missing from argument(s) 2, and column(s) ID, Name and Job are missing from argument(s) 3

I tried to get to the bottom of this reading the documentation with no luck. Can anyone clarify this to me please?

Cazzie answered 26/7, 2018 at 17:17 Comment(0)

Now vcat in DataFrames is strict about the fact that concatenated DataFrames contain the same columns.

If you read help of vcat after loading DataFrames package you will find:

Column names in all passed data frames must be the same, but they can have different order. In such cases the order of names in the first passed DataFrame is used.

The way to fix it is to add missing columns to all the data frames. Here is how you can do it in place in your example (note though that now missing is used to indicate missing value):

julia> for n in unique([names(A); names(B); names(C)]), df in [A,B,C]
       n in names(df) || (df[n] = missing)
       end

julia> [A; B; C]
6×5 DataFrames.DataFrame
│ Row │ ID      │ Name     │ Job       │ Year    │ Test      │
├─────┼─────────┼──────────┼───────────┼─────────┼───────────┤
│ 1   │ 20      │ John Doe │ missing   │ missing │ missing   │
│ 2   │ 40      │ Jane Doe │ missing   │ missing │ missing   │
│ 3   │ 60      │ missing  │ Sailor    │ missing │ missing   │
│ 4   │ 80      │ missing  │ Sommelier │ missing │ missing   │
│ 5   │ missing │ missing  │ missing   │ 1978    │ Something │
│ 6   │ missing │ missing  │ missing   │ 1982    │ Somewhere │

If you want to avoid modifying original DataFrames you should copy them first.

Overmatter answered 26/7, 2018 at 17:51 Comment(2)

thanks a lot @Bogumil Kaminski, it works great and it is also very clean! However I was reading that many are concerned about the performance of vcat and potentially I have to aggregate millions of rows for ~20 columns. Do you think I should be concerned? Are there better ways to do it? Thanks a lot – Cazzie 26/7, 2018 at 18:33

It should be fast as vcat allocates the result only once using the appropriate target size and then uses copyto!. The only issue you might hit is when you have thousands of DataFrames to vcat as splatting is not efficient then (if you have a moderate number of even large DataFrames all should be fine). If you are in such a situation you can use DataFrames._vcat which accepts a vector as an argument. It is a bit of abuse, because it is not exported so it might be changed or removed in the future without a warning. You can inspect the source of _vcat to see how it works. – Lycanthropy 26/7, 2018 at 18:45

It seems that the answer provided by @bogumił-kamiński does not work with recent Julia versions. See this reproducible example run with Julia v1.8.5:

A = DataFrame(ID = [20,40], Name = ["John Doe", "Jane Doe"])
B = DataFrame(ID = [60,80], Job = ["Sailor", "Sommelier"])
C = DataFrame(Year = [1978, 1982], Test = ["Something", "Somewhere"])

for n in unique([names(A); names(B); names(C)]), df in [A,B,C]
       n in names(df) || (df[n] = missing)
       end

[A; B; C]

Produces the following error:

ERROR: ArgumentError: syntax df[column] is not supported use df[!, column] instead

With recent Julia versions (at least with v1.8.5), we need to replace (df[n] = missing) by (df[:, n] .= missing), that is using the syntax suggested by the error message, and adding a dot operator in before equals for in-place operation:

A = DataFrame(ID = [20,40], Name = ["John Doe", "Jane Doe"])
B = DataFrame(ID = [60,80], Job = ["Sailor", "Sommelier"])
C = DataFrame(Year = [1978, 1982], Test = ["Something", "Somewhere"])

for n in unique([names(A); names(B); names(C)]), df in [A,B,C]
       n in names(df) || (df[:, n] .= missing)
       end

[A; B; C]

Which produces the following result:

6×5 DataFrame
 Row │ ID       Name      Job        Year     Test
     │ Int64?   String?   String?    Int64?   String?
─────┼──────────────────────────────────────────────────
   1 │      20  John Doe  missing    missing  missing
   2 │      40  Jane Doe  missing    missing  missing
   3 │      60  missing   Sailor     missing  missing
   4 │      80  missing   Sommelier  missing  missing
   5 │ missing  missing   missing       1978  Something
   6 │ missing  missing   missing       1982  Somewhere

Rubi answered 2/5, 2023 at 13:13 Comment(1)

this isn't a change in Julia. This is a change in DataFrames. I believe this changed in DataFrames 1.0. – Briant 2/5, 2023 at 15:56

Recommended topics

Hot tags