I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,
{
"key1" : "value1",
"key2" : "value2"
}{
"key1" : "value1",
"key2" : "value2"
}
In Apache Spark I am doing this to read the data file,
df = spark.read.schema(schema).json(path, multiLine=True)
This can read only the first json object in the file and the rest neglected because there is no seperator.
How can I use resolve this issue in spark?
wholTextFiles
and parse manually - but it is bad performance wise. You can try to use Hadoop Input format with delimiter if structure is always delimited by}{
, and then fix records, but it is hack. You can implement your own input format, but not in Python, and it is a lot of code for such a problem. But honestly - if process is under your control, don't waste time on fixing the symptoms, fix the problem :) – Pneumodynamics