Efficient handling of sparsely missing data in Haskell
Asked Answered
L

1

11

I am trying to use Haskell for data analysis. Because my datasets are reasonably large (hundreds of thousands and potentially millions of observations), I would ideally like to use an unboxed data structure for efficiency, say Data.Vector.Unboxed.

The problem is that the data contain some missing values. I want to avoid coding these as "99" or similar because that's just an ugly hack and a potential source of bugs. From my Haskell newbie point of view, I can think of the following options:

  1. A boxed vector of unpacked Maybe values. Something like (please correct if wrong):
    data myMaybe a = Nothing | Just {-# UNPACK #-} !a
  2. An unboxed vector of (unboxable) tuples, whith a boolean element indicating missingness:
    newtype instance Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
    This may be the same approach as chosen by the OP of this question (modulo Int for Bool), but the only answer doesn't seem to explicitly address the issue of missing values/sparsity (instead focusing on how to represent an entire array unboxed, rather than as a boxed vector of unboxed vectors).
  3. A tuple of unboxed vectors, one with the values, the other with the indices at which missing values are to be injected, or the run lengths of non-missing values, or some equivalent information. This might be preferable to option 2. if missings are sparse?

I'm trying to stay within a vector representation rather than something like this, because it's the missing values that are sparse, not the data.

Any comments on the relative merits/feasibility/off-the-shelf-availability/likely performance of these options, or indeed pointers to entirely different alternatives, are welcome!

Edit:

  • It's been pointed out that the answer potentially depends on what kind of operations I intend to perform on the data. At the moment, it seems more convenient to store each observation in a single vector, rather than each variable. Since the entries in the vector will therefore refer to different variables, "fold"-like ops are unlikely.
  • I'm guessing 2. will internally store the "valid bit" vector à la 3. automatically if appropriate, so 3. could be dropped?
Lawson answered 13/11, 2011 at 9:3 Comment(1)
The way you want to store the "valid" bits depends on how you are going to process you data. For instance, if you're going to map some function over your data you can keep the bits separate because they will be correct even after mapping. But you are going to, say, fold over your data it's more convenient to keep the bits together with the data item.Radii
D
6

I'd go with option 3, but you should not use a vector to store the missing-indizes: that gives you O(nMissing) lookup time, which is unreasonably slow unless the missing data is extremely sparse. Data.IntMap should do the job well, you can then easily use the member function to check if an index points to a missing observation. Hash tables are even better but probably not necessary.

Downtrodden answered 13/11, 2011 at 12:43 Comment(5)
If the vector of missing indices is sorted, a binary search would give O(log nMissing) for that too.Scimitar
@DanielFischer: right, but a) it's troublesome to actually keep an array sorted and b), Data.IntMap (in contrast to Data.Map) searches typically faster than that, doesn't it? (I'm not sure, prove me wrong)Downtrodden
Re a): You only need to find the position to insert or delete and shift the items behind. It's just slow (unless there are very few missing items), not troublesome. Re b): IntMap lookup typically also needs O(log size) comparisons (since size <= 2^WORD_SIZE_IN_BITS), so which would have faster lookup is a matter of constant factors, I can easily believe either to win. However, since IntMap has better insert/delete behaviour and the more suitable API, I would also recommend that (until proven insufficient by experience).Scimitar
Thank you, those are good points. But why do you think that either of the strategies you mention would outperform whatever "specialised representation" Data.Vector.Unboxed would pick for the validity vector? I'm happy to be convinced that manual fine-tuning will perform better, just trying to understand why.Lawson
@BilalBarakat: the thing is, Data.Vector's elem function, which you would likely use to determine if an index is missing, has O(n) complexity because it can't assume the array is sorted. So you would either need to make sure it's sorted and implement your own e.g. elemInSorted function, or use some associative container (like Data.IntMap, which remains my recommendation), or use a completely different representation.Downtrodden

© 2022 - 2024 — McMap. All rights reserved.