I'm currently playing with Apache Arrow's java API (though I use it from Scala for the code samples) to get some familiarity with this tool.
As an exercise, I chose to load a CSV file into arrow vectors and then to save these to an arrow file. The first part seemed easy enough, and I tried it like this:
val csvLines: Stream[Array[String]] = <open stream from CSV parser>
// There are other types of allocator, but things work with this one...
val allocator = new RootAllocator(Int.MaxValue)
// Initialize the vectors
val vectors = initVectors(csvLines.head, allocator)
// Put their mutators into an array for easy access
val mutators = vectors.map(_.getMutator)
// Work on the data, zipping it with its index
Stream.from(0)
.zip(csvLines.tail) // Work on the tail (head contains the headers)
.foreach(rowTup => // rowTup = (index, csvRow as an Array[String])
Range(0, rowTup._2.size) // Iterate on each column...
.foreach(columnNumber =>
writeToMutator(
mutators(columnNumber), // get that column's mutator
idx=rowTup._1, // pass the current row number
data=rowTup._2(columnNumber) // pass the entry of the curernt column
)
)
)
With initVectors()
and writeToMutator()
defined as:
def initVectors(
columns: Array[String],
alloc: RootAllocator): Array[NullableVarCharVector] = {
// Initialize a vector for each column
val vectors = columns.map(colName =>
new NullableVarCharVector(colName, alloc))
// 4096 size, for 1024 values initially. This is arbitrary
vectors.foreach(_.allocateNew(2^12,1024))
vectors
}
def writeToMutator(
mutator: NullableVarCharVector#Mutator,
idx: Int,
data: String): Unit = {
// The CSV may contain null values
if (data != null) {
val bytes = data.getBytes()
mutator.setSafe(idx, bytes, 0, bytes.length)
}
mutator.setNull(idx)
}
(I currently don't care about using the correct type, and store everything as strings, or VarChar
in arrow's terns)
So at this point I have a collection of NullableVarCharVector
and can read and write from/to them. Everything great at this point. Now, for the next step, though, I was left wondering about how to actually wrap them together and serialize them to an arrow file. I stumbled on an AbstractFieldWriter
abstract class, but how to use the implementations is unclear.
So, the question mainly is:
- what is the (best? – there seem to be multiple ones) way to save a bunch of vectors to an arrow file.
- are there other ways of loading CSV columns to arrow vectors?
edited to add: The metadata description page provides a good general overview on that topic.
The api's test classes seem to contain a few things that could help, I'll post a reply with a sample once I've tried it out.