julia DataFrame select rows based values of one column belonging to a set
Asked Answered
P

1

6

Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.

With the following example

using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))

I want to extract the rows with ID taking the values 1 and 4. For the moment, I came out with that solution.

@where(DT, findall(x -> (x==4 || x==1), DT.ID))

When using only two values, it is manageable.

However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected

Any fancier solution to make this selection generic?

Damien

Prolegomenon answered 3/10, 2019 at 13:25 Comment(0)
T
10

Here is a way to do it using standard DataFrames.jl indexing and using @where from DataFramesMeta.jl:

julia> DT
8×2 DataFrame
│ Row │ ID    │ x1        │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.433397  │
│ 2   │ 1     │ 0.963775  │
│ 3   │ 2     │ 0.365919  │
│ 4   │ 2     │ 0.325169  │
│ 5   │ 3     │ 0.0495252 │
│ 6   │ 3     │ 0.637568  │
│ 7   │ 4     │ 0.391051  │
│ 8   │ 4     │ 0.436209  │

julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

julia> @where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):

julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

Note that in the approach you mention in your question you could omit DT in front of ID like this:

julia> @where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)

Tiemannite answered 3/10, 2019 at 14:8 Comment(2)
Thank you for these different options. Is it me or the help documentation for "in" does not mention this use ?Prolegomenon
It mentions it by saying "in(x) Create a function that checks whether its argument is in x, i.e. a function equivalent to y -> y in x.". But I agree that it is non obvious that next you can use it in broadcasting. Soon a documentation of DataFrames.jl will be updated with such an example, see github.com/JuliaData/DataFrames.jl/pull/1971.Irrational

© 2022 - 2024 — McMap. All rights reserved.