Splitting datasets into train and test in julia

Asked 5/2, 2021 at 7:18 Answered 17/2, 2021 at 5:57

I am trying to split the dataset into train and test subsets in Julia. So far, I have tried using MLDataUtils.jl package for this operation, however, the results are not up to the expectations. Below are my findings and issues:

Code

# the inputs are

a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
              C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
             )
b = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

using MLDataUtils
(x1, y1), (x2, y2) = stratifiedobs((a,b), p=0.7)

#Output of this operation is: (which is not the expectation)
println("x1 is: $x1")
x1 is:
10×3 DataFrame
│ Row │ A     │ B     │ C     │
│     │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┤
│ 1   │ 1     │ 1     │ 1     │
│ 2   │ 2     │ 2     │ 2     │
│ 3   │ 3     │ 3     │ 3     │
│ 4   │ 4     │ 4     │ 4     │
│ 5   │ 5     │ 5     │ 5     │
│ 6   │ 6     │ 6     │ 6     │
│ 7   │ 7     │ 7     │ 7     │
│ 8   │ 8     │ 8     │ 8     │
│ 9   │ 9     │ 9     │ 9     │
│ 10  │ 10    │ 10    │ 10    │

println("y1 is: $y1")
y1 is:
10-element Array{Int64,1}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10

# but x2 is printed as 
(0×3 SubDataFrame, Float64[]) 

# while y2 as 
0-element view(::Array{Float64,1}, Int64[]) with eltype Float64)

However, I would like this dataset to be split in 2 parts with 70% data in train and 30% in test. Please suggest a better approach to perform this operation in julia. Thanks in advance.

Pentylenetetrazol answered 5/2, 2021 at 7:18 Comment(1)

I do not have it installed but it seems mldatautilsjl.readthedocs.io/en/latest/data/… explains in detail how to use MLJ.jl to perform the split. – Kozlowski 5/2, 2021 at 8:12

Probably MLJ.jl developers can show you how to do it using the general ecosystem. Here is a solution using DataFrames.jl only:

julia> using DataFrames, Random

julia> a = DataFrame(A = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     B = [1, 2, 3, 4,5, 6, 7, 8, 9, 10],
                     C = [1, 2, 3, 4,5, 6, 7, 8, 9, 10]
                    )
10×3 DataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     3      3      3
   4 │     4      4      4
   5 │     5      5      5
   6 │     6      6      6
   7 │     7      7      7
   8 │     8      8      8
   9 │     9      9      9
  10 │    10     10     10

julia> function splitdf(df, pct)
           @assert 0 <= pct <= 1
           ids = collect(axes(df, 1))
           shuffle!(ids)
           sel = ids .<= nrow(df) .* pct
           return view(df, sel, :), view(df, .!sel, :)
       end
splitdf (generic function with 1 method)

julia> splitdf(a, 0.7)
(7×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     3      3      3
   2 │     4      4      4
   3 │     6      6      6
   4 │     7      7      7
   5 │     8      8      8
   6 │     9      9      9
   7 │    10     10     10, 3×3 SubDataFrame
 Row │ A      B      C     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1      1
   2 │     2      2      2
   3 │     5      5      5)

I am using views to save memory, but alternatively you could just materialize train and test data frames if you prefer this.

Mordy answered 5/2, 2021 at 8:10 Comment(1)

Kaminski , Thanks for the suggestion, it works and surprisingly it is quite fast compared to sklearn version in python. Great Work!! – Pentylenetetrazol 9/2, 2021 at 5:13

using Pkg Pkg.add("Lathe") using Lathe.preprocess: TrainTestSplit train, test = TrainTestSplit(df) There is also a positional argument, at in the second position that takes a percentage to split at.

Cluny answered 17/2, 2021 at 5:57 Comment(0)

This is how I did implement it for generic arrays in the Beta Machine Learning Toolkit:

"""
    partition(data,parts;shuffle=true)
Partition (by rows) one or more matrices according to the shares in `parts`.
# Parameters
* `data`: A matrix/vector or a vector of matrices/vectors
* `parts`: A vector of the required shares (must sum to 1)
* `shufle`: Wheter to randomly shuffle the matrices (preserving the relative order between matrices)
 """
function partition(data::AbstractArray{T,1},parts::AbstractArray{Float64,1};shuffle=true) where T <: AbstractArray
        n = size(data[1],1)
        if !all(size.(data,1) .== n)
            @error "All matrices passed to `partition` must have the same number of rows"
        end
        ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
        return partition.(data,Ref(parts);shuffle=shuffle, fixedRIdx = ridx)
end

function partition(data::AbstractArray{T,N} where N, parts::AbstractArray{Float64,1};shuffle=true,fixedRIdx=Int64[]) where T
    n = size(data,1)
    nParts = size(parts)
    toReturn = []
    if !(sum(parts) ≈ 1)
        @error "The sum of `parts` in `partition` should total to 1."
    end
    ridx = fixedRIdx
    if (isempty(ridx))
       ridx = shuffle ? Random.shuffle(1:n) : collect(1:n)
    end
    current = 1
    cumPart = 0.0
    for (i,p) in enumerate(parts)
        cumPart += parts[i]
        final = i == nParts ? n : Int64(round(cumPart*n))
        push!(toReturn,data[ridx[current:final],:])
        current = (final +=1)
    end
    return toReturn
end

Use it with:

julia> x = [1:10 11:20]
julia> y = collect(31:40)
julia> ((xtrain,xtest),(ytrain,ytest)) = partition([x,y],[0.7,0.3])

Ore that you can partition also in three or more parts, and the number of arrays to partition also is variable.

By default they are also shuffled, but you can avoid it with the parameter shuffle...

Pimpernel answered 5/2, 2021 at 20:18 Comment(0)

Recommended topics

Hot tags