I am currently using Athena along with Kinesis Firehose
, Glue Crawler
. Kinesis Firehose
is saving JSON to single line files as below
{"name": "Jone Doe"}{"name": "Jane Doe"}{"name": "Jack Doe"}
But I noticed that the athena query select count(*) from db.names
returns 1 instead of 3. After searching for the problem. I found the following document.
The article says that JSON files files should stored with new lines.
{"name": "Jone Doe"}
{"name": "Jane Doe"}
{"name": "Jack Doe"}
Is there some smart tricks to run athena query on the single line JSON files?
Thanks to @Constantine, AWS Athena is performing distributed processing. Since, single-line JSON files doesn't have seperator, It can't perform distributed processing. So, You must transform the files before saving it.
Kinesis Firehose offers transformation using Lambda, I added following transformation, in order to query data from AWS Athena.
const addNewLine = (data) => {
const parsedData = JSON.parse(new Buffer.from(data,'base64').toString('utf8'));
return new Buffer.from(JSON.stringify(parsedData) + '\n').toString('base64')
}
exports.handler = async (event, context) => {
const output = event.records.map((record) => ({
recordId: record.recordId,
result: 'Ok',
data: addNewLine(record.data),
}));
return { records: output };
};
I've come up with this code through following link AWS Firehose newline Character
I believe there is no way a file with such JSON can be processed properly because a separator is required in order to distribute work. There is no explicit information in documentation on how to provide a custom separator, and most likely it is not possible in supported JSON SerDe libraries. Besides that, there is no distinct separator between given JSON objects that is not used inside JSON itself. In fact, there is no separator at all.
However, it is possible to use Firehose Data Transformation to buffer incoming data and invoke a Lambda function with each buffer asynchronously. There are predefined Lambda blueprints, and Kinesis Firehose Processing
can be used in this case to add new line characters between JSON objects.
Each transformed record is supposed to contain recordId
, result
and Base64 encoded data
with the transformed payload.
There are multiple examples of such Lambda function, e.g. this python sample in Amazon AWS samples repos on GitHub.