How to correctly read column from file when first element is empty

Asked 5/12, 2022 at 9:39 Answered 5/12, 2022 at 11:44

I have a data file data.txt

  a  
5 b 
3 c 7

which I would like to load and have as

 julia> loaded_data
3×3 Matrix{Any}:
 ""   "a"  ""
 5  "b"  ""
 3  "c"  7

but it is unclear to me how to do this. Trying readdlm

julia> using DelimitedFiles

julia> readdlm("data.txt")
3×3 Matrix{Any}:
  "a"  ""    ""
 5     "b"   ""
 3     "c"  7

does not correctly identify the first element of the first column as empty space, and instead reads "a" as the first element (which of course makes sense that it would). The closest I think I've gotten to what I want is using readlines

julia> readlines("data.txt")
3-element Vector{String}:
 "  a  "
 "5 b "
 "3 c 7"

but from here I'm not sure how to proceed. I can grab one of the rows with all the columns and split it, but not sure how that helps me identify the empty elements in other rows.

Norse answered 5/12, 2022 at 9:39 Comment(0)

This problem may have many edge cases to clarify.

Here is a longer option than the other answer, but perhaps better suited to tweak for the edge cases:

function splittable(d)
    # find all non-space locations
    t = sort(union(findall.(!isspace, d)...))
    # find initial indices of fields
    tt = t[vcat(1,findall(diff(t).!=1).+1)]
    # prepare ranges to extract fields
    tr = [tt[i]:tt[i+1]-1 for i in 1:length(tt)-1]
    # extract substrings
    vs = map(s -> strip.(vcat([s[intersect(r,eachindex(s))] for r in tr],
                              tt[end]<=length(s) ? s[tt[end]:end] : "")), d)
    # fit substrings into matrix
    L = maximum(length.(vs))
    String.([j <= length(vs[i]) ? vs[i][j] : "" 
      for i in 1:length(vs), j in 1:L])
end

And:

julia> d = readlines("data.txt")
3-element Vector{String}:
 "  a  "
 "5 b "
 "3 c 7"

julia> dd = splittable(d)
3×3 Matrix{String}:
 ""   "a"  ""
 "5"  "b"  ""
 "3"  "c"  "7"

To get the partial parsing effect:

function parsewhatmay(m)
    M = tryparse.(Int, m)
    map((x,y)->isnothing(x) ? y : x, M, m)
end

and now:

julia> parsewhatmay(dd)
3×3 Matrix{Any}:
  ""  "a"   ""
 5    "b"   ""
 3    "c"  7

Disable answered 5/12, 2022 at 10:19 Comment(1)

I like the compactness of the other solutions, but this one was the only one that worked "out of the box" for my more complex inputs. Thanks! Edit: your splittable function was the key to my problem! – Norse 5/12, 2022 at 23:57

Here's a possibility:


cnv(s) = (length(s) > 0 && all(isdigit, s)) ? parse(Int, s) : s

cnv.(stack(split.(replace.(eachline("data.txt"),"  "=>" "), " "), dims=1))

Staphylo answered 5/12, 2022 at 10:1 Comment(1)

I really like this solution because it's a very compact way to solve the mwe I provided. But for bigger input files that may contain multiple adjacent empty elements per row, it's a bit non-trivial to generalize this (different spacing requirements per input). – Norse 5/12, 2022 at 23:54

If the contents of the columns are sufficiently distinguishable to make the parsing uniquely defined, I'd use a regex on each line:

julia> lines
3-element Vector{String}:
 "  a  "
 "5 b "
 "3 c 7"

julia> [match(r"\s*(\d*)\s*([a-z]*)\s*(\d*)", s).captures for s in lines]
3-element Vector{Vector{Union{Nothing, SubString{String}}}}:
 ["", "a", ""]
 ["5", "b", ""]
 ["3", "c", "7"]

You can then proceed to parse and concatenate as you wish, e.g.

julia> mapreduce(vcat, lines) do line
           x, y, z = match(r"\s*(\d*)\s*([a-z]*)\s*(\d*)", line).captures
           [tryparse(Int, x) y tryparse(Int, z)]
       end
3×3 Matrix{Any}:
  nothing  "a"   nothing
 5         "b"   nothing
 3         "c"  7

In Julia 1.9, I think you should be able to write this as

stack(lines; dims=1) do line
    x, y, z = match(r"\s*(\d*)\s*([a-z]*)\s*(\d*)", line).captures
    (tryparse(Int, x), y, tryparse(Int, z))
end

Khartoum answered 5/12, 2022 at 11:44 Comment(1)

I really thought I was going to get this answer to work but I just spent too much time figuring out how to generate the appropriate regex for more general inputs with a broader set of elements (to no avail). I'm convinced with enough time I could get it down, but I'm simply not too experienced with regex. I definitely want to spend more time generalizing this solution. – Norse 5/12, 2022 at 23:56