[SOLVED] s3 - how to get fast line count of file? wc -l is too slow

s3 - how to get fast line count of file? wc -l is too slow

Does anyone have a quick way of getting the line count of a file hosted in S3? Preferably using the CLI, s3api but I am open to python/boto as well. Note: solution must run non-interactively, ie in an overnight batch.

Right no i am doing this, it works but takes around 10 minutes for a 20GB file:

 aws cp s3://foo/bar - | wc -l

Solution

UPDATE 2024: Amazon S3 Select is no longer available to new users.

Here's two methods that might work for you...

Amazon S3 has a new feature called S3 Select that allows you to query files stored on S3.

You can perform a count of the number of records (lines) in a file and it can even work on GZIP files. Results may vary depending upon your file format.

S3 Select

Amazon Athena is also a similar option that might be suitable. It can query files stored in Amazon S3.