I have data in a normalised, tidy "long" data structure I want to upload to H2O and if possible analyse on a single machine (or have a definitive finding that I need more hardware and software than currently available). The data is large but not enormous; perhaps 70 million rows of 3 columns in its efficient normalised form, and 300k by 80k when it has been cast into a sparse matrix (a big majority of cells being zeroes).
The analytical tools in H2O need it to be in the latter, wide, format. Part of the overall motivation is seeing where the limits of various hardware setups is with analysing such data, but at the moment I'm struggling just to get the data into an H2O cluster (on a machine where R can hold it all in RAM) so can't make the judgments about size limits for analysis.
The trial data are like the below, where the three columns are "documentID", "wordID" and "count":
1 61 2
1 76 1
1 89 1
1 211 1
1 296 1
1 335 1
1 404 1
Not that it matters - because this isn't even a real life dataset for me, just a test set - this test data is from https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz (caution, large download).
To analyse, I need it in a matrix with a row for each documentID, a column for each wordID, and the cells are the counts (number of that word in that document). In R (for example), this can be done with tidyr::spread
or (as in this particular case the dense data frame created by spread
would be too large) tidytext::cast_sparse
, which works fine with this sized data so long as I am happy for the data to stay in R.
Now, the very latest version of H2O (available from h2o.ai but not yet on CRAN) has the R function as.h2o
which understands sparse matrices, and this works well with smaller but still non-trivial data (eg in test cases of 3500 rows x 7000 columns it imports a sparse matrix in 3 seconds when the dense version takes 22 seconds), but when it gets my 300,000 x 80,000 sparse matrix it crashes with this error message:
Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
As far as I can tell there are two ways forward:
- upload a long, tidy, efficient form of the data into H2O and do the reshaping "spread" operation in H2O.
- do the data reshaping in R (or any other language), save the resulting sparse matrix to disk in a sparse format, and upload from there into H2O
As far as I can tell, H2O doesn't have the functionality to do #1 ie the equivalent of tidytext::cast_sparse
or tidyr::spread
in R. Its data munging capabilities look to be very limited. But maybe I've missed something? So my first (not very optimistic) question is can (and how can) H2O "cast" or "spread" data from long to wide format?.
Option #2 becomes the same as this older question, for which the accepted answer was to save the data in SVMlight format. However, it's not clear to me how to do this efficiently, and it's not clear that SVMlight format makes sense for data that is not intended to be modelled with a support vector machine (for example, the data might be just for an unsupervised learning problem). It would be much more convenient if I could save my sparse matrix in MatrixMarket format, which is supported by the Matrix
package in R, but as far as I can tell isn't by H2O. The MatrixMarket format looks very similar to my original long data, it's basically a space-delimited file that looks like colno rowno cellvalue
(with a two line header).