Pyspark: Reading JSON data file with no separator between objects
Asked Answered
M

1

7

I have a kinesis firehose delivery stream that puts data to S3. However in the data file the json objects has no separator between it. So it looks something like this,

{
  "key1" : "value1",
  "key2" : "value2"
}{
  "key1" : "value1",
  "key2" : "value2"
}

In Apache Spark I am doing this to read the data file,

df = spark.read.schema(schema).json(path, multiLine=True)

This can read only the first json object in the file and the rest neglected because there is no seperator.

How can I use resolve this issue in spark?

Mariannemariano answered 12/1, 2018 at 1:50 Comment(3)
Fix the upstream process? Anything you'll do in Spark will be at least somewhat inefficient and ugly.Pneumodynamics
makes sense. but i would like to know the rdd based approach to solve this. or if there is any better approach ofcourse.Mariannemariano
Off the top of my head: you can use wholTextFiles and parse manually - but it is bad performance wise. You can try to use Hadoop Input format with delimiter if structure is always delimited by }{, and then fix records, but it is hack. You can implement your own input format, but not in Python, and it is a lot of code for such a problem. But honestly - if process is under your control, don't waste time on fixing the symptoms, fix the problem :)Pneumodynamics
R
9

You can use sparkContext's wholeTextFiles api to read the json file into Tuple2(filename, whole text), parse the whole text to multiLine jsons, and then finally use sqlContext to read it as json to dataframe.

sqlContext\
    .read\
    .json(sc
          .wholeTextFiles("path to your multiline json file")
          .values()
          .flatMap(lambda x: x
                   .replace("\n", "#!#")
                   .replace("{#!# ", "{")
                   .replace("#!#}", "}")
                   .replace(",#!#", ",")
                   .split("#!#")))\
    .show()

you should get dataframe as

+------+------+
|  key1|  key2|
+------+------+
|value1|value2|
|value1|value2|
+------+------+

You can modify the code according to your need though

Rugen answered 13/1, 2018 at 2:16 Comment(2)
Hi, my data is structured as follows, what might you recommend if I wanted restaurant id as values in one column, latitude and longitude in other columns? Thanks! ===> [{"restaurant_id": "1234", "infos": [{"timestamp": "2020-02-03T00:57:26.000Z", "longitude": "-123, "latitude": "456"}{"restaurant_id": "5678", "infos":[{"timestamp": "2....Conditional
Really helpful. Worked like charm. ThanksElemi

© 2022 - 2024 — McMap. All rights reserved.