How AWS Athena deals with single line JSONs?

Asked 7/6, 2020 at 16:4 Answered 26/9, 2023 at 8:7

Solved aws-glue amazon-athena amazon-kinesis-firehose

I am currently using Athena along with Kinesis Firehose, Glue Crawler. Kinesis Firehose is saving JSON to single line files as below

{"name": "Jone Doe"}{"name": "Jane Doe"}{"name": "Jack Doe"}

But I noticed that the athena query select count(*) from db.names returns 1 instead of 3. After searching for the problem. I found the following document.

https://aws.amazon.com/premiumsupport/knowledge-center/select-count-query-athena-json-records/?nc1=h_ls

The article says that JSON files files should stored with new lines.

{"name": "Jone Doe"}
{"name": "Jane Doe"}
{"name": "Jack Doe"}

Is there some smart tricks to run athena query on the single line JSON files?

Update

Thanks to @Constantine, AWS Athena is performing distributed processing. Since, single-line JSON files doesn't have seperator, It can't perform distributed processing. So, You must transform the files before saving it.

Kinesis Firehose offers transformation using Lambda, I added following transformation, in order to query data from AWS Athena.

const addNewLine = (data) => {
   const parsedData = JSON.parse(new Buffer.from(data,'base64').toString('utf8'));
   return new Buffer.from(JSON.stringify(parsedData) + '\n').toString('base64')
}

exports.handler = async (event, context) => {
   const output = event.records.map((record) => ({
       recordId: record.recordId,
       result: 'Ok',
       data: addNewLine(record.data),
   }));
   return { records: output };
};

I've come up with this code through following link AWS Firehose newline Character

Filial answered 7/6, 2020 at 16:4 Comment(0)

I believe there is no way a file with such JSON can be processed properly because a separator is required in order to distribute work. There is no explicit information in documentation on how to provide a custom separator, and most likely it is not possible in supported JSON SerDe libraries. Besides that, there is no distinct separator between given JSON objects that is not used inside JSON itself. In fact, there is no separator at all.

However, it is possible to use Firehose Data Transformation to buffer incoming data and invoke a Lambda function with each buffer asynchronously. There are predefined Lambda blueprints, and Kinesis Firehose Processing can be used in this case to add new line characters between JSON objects.

Each transformed record is supposed to contain recordId, result and Base64 encoded data with the transformed payload. There are multiple examples of such Lambda function, e.g. this python sample in Amazon AWS samples repos on GitHub.

Coble answered 8/6, 2020 at 13:13 Comment(1)

Thank you! I realized it's technically impossible! I will try to save it with separator as you said! – Filial 8/6, 2020 at 13:43

We don't need any processing here:

While creating the iot topic.

aws iot create-topic-rule --rule-name aws_iot_streaming --topic-rule-payload file://rule.json

Use this as rule.json file :

{
  "sql": "SELECT * FROM 'aws-iot-streaming'",
  "ruleDisabled": false,
  "awsIotSqlVersion": "2016-03-23",
  "actions": [{
      "firehose": {
          "roleArn": "arn:aws:iam::XXXXXXXXXXXX:role/iot-rule",
          "deliveryStreamName": "PUT-S3-fggfn",
          "separator": "\n"
      }
  }]
}

Use separator as "\n"

Aloisia answered 26/9, 2023 at 8:7 Comment(0)

Update

Recommended topics

Hot tags