My understanding of Spark's fileStream()
method is that it takes three types as parameters: Key
, Value
, and Format
. In case of text files, the appropriate types are: LongWritable
, Text
, and TextInputFormat
.
First, I want to understand the nature of these types. Intuitively, I would guess that the Key
in this case is the line number of the file, and the Value
is the text on that line. So, in the following example of a text file:
Hello
Test
Another Test
The first row of the DStream
would have a Key
of 1
(0
?) and a Value
of Hello
.
Is this correct?
Second part of my question: I looked at the decompiled implementation of ParquetInputFormat
and I noticed something curious:
public class ParquetInputFormat<T>
extends FileInputFormat<Void, T> {
//...
public class TextInputFormat
extends FileInputFormat<LongWritable, Text>
implements JobConfigurable {
//...
TextInputFormat
extends FileInputFormat
of types LongWritable
and Text
, whereas ParquetInputFormat
extends the same class of types Void
and T
.
Does this mean that I must create a Value
class to hold an entire row of my parquet data, and then pass the types <Void, MyClass, ParquetInputFormat<MyClass>>
to ssc.fileStream()
?
If so, how should I implement MyClass
?
EDIT 1: I have noticed a readSupportClass
which is to be passed to ParquetInputFormat
objects. What kind of class is this, and how is it used to parse the parquet file? Is there some documentation that covers this?
EDIT 2: As far as I can tell, this is impossible. If anybody knows how to stream in parquet files to Spark then please feel free to share...