I am currently using Athena along with Kinesis Firehose
, Glue Crawler
. Kinesis Firehose
is saving JSON to single line files as below
{"name": "Jone Doe"}{"name": "Jane Doe"}{"name": "Jack Doe"}
But I noticed that the athena query select count(*) from db.names
returns 1 instead of 3. After searching for the problem. I found the following document.
The article says that JSON files files should stored with new lines.
{"name": "Jone Doe"}
{"name": "Jane Doe"}
{"name": "Jack Doe"}
Is there some smart tricks to run athena query on the single line JSON files?
Update
Thanks to @Constantine, AWS Athena is performing distributed processing. Since, single-line JSON files doesn't have seperator, It can't perform distributed processing. So, You must transform the files before saving it.
Kinesis Firehose offers transformation using Lambda, I added following transformation, in order to query data from AWS Athena.
const addNewLine = (data) => {
const parsedData = JSON.parse(new Buffer.from(data,'base64').toString('utf8'));
return new Buffer.from(JSON.stringify(parsedData) + '\n').toString('base64')
}
exports.handler = async (event, context) => {
const output = event.records.map((record) => ({
recordId: record.recordId,
result: 'Ok',
data: addNewLine(record.data),
}));
return { records: output };
};
I've come up with this code through following link AWS Firehose newline Character