I am trying to use Haskell for data analysis. Because my datasets are reasonably large (hundreds of thousands and potentially millions of observations), I would ideally like to use an unboxed data structure for efficiency, say Data.Vector.Unboxed.
The problem is that the data contain some missing values. I want to avoid coding these as "99" or similar because that's just an ugly hack and a potential source of bugs. From my Haskell newbie point of view, I can think of the following options:
- A boxed vector of unpacked
Maybe
values. Something like (please correct if wrong):
data myMaybe a = Nothing | Just {-# UNPACK #-} !a
- An unboxed vector of (unboxable) tuples, whith a boolean element indicating missingness:
newtype instance Data.Vector.Unboxed.Vector (MyDatum a) = MyDatum (Data.Vector.Unboxed.Vector (Bool,a))
This may be the same approach as chosen by the OP of this question (moduloInt
forBool
), but the only answer doesn't seem to explicitly address the issue of missing values/sparsity (instead focusing on how to represent an entire array unboxed, rather than as a boxed vector of unboxed vectors). - A tuple of unboxed vectors, one with the values, the other with the indices at which missing values are to be injected, or the run lengths of non-missing values, or some equivalent information. This might be preferable to option 2. if missings are sparse?
I'm trying to stay within a vector representation rather than something like this, because it's the missing values that are sparse, not the data.
Any comments on the relative merits/feasibility/off-the-shelf-availability/likely performance of these options, or indeed pointers to entirely different alternatives, are welcome!
Edit:
- It's been pointed out that the answer potentially depends on what kind of operations I intend to perform on the data. At the moment, it seems more convenient to store each observation in a single vector, rather than each variable. Since the entries in the vector will therefore refer to different variables, "fold"-like ops are unlikely.
- I'm guessing 2. will internally store the "valid bit" vector à la 3. automatically if appropriate, so 3. could be dropped?