I took a look at the link and trying to understand what s3 select
is.
Most applications have to retrieve the entire object and then filter out only the required data for further analysis. S3 Select enables applications to offload the heavy lifting of filtering and accessing data inside objects to the Amazon S3 service.
Based on the statement above, I am trying to imagine what is the proper use case.
Is it helpful that if I have a single excel file with 100million rows, sitting on S3, I can use S3 Select to query partial rows, instead of downloading the entire 100mil rows?
There are many use cases. But two cases that are apparent are centralization and time efficiency.
Lets say you have this "single excel file with 100million rows" in S3. Now if you have several people/department/branches that need to access it, all of them would have to download it, store and process. Since it would be downloaded by each of them separately, in no time you would end up with all of them either having old version of the file (new version could be uploaded to S3), or just different versions - one person version from today, the other would work on a version from last week. With S3 select, all of them would query and get data from the one version of the object stored in S3.
Also if you have 100 million of records, you getting selected data can save you a lot of time. Just image one person needing only 10 records from this file, other person 1000 records. Instead of downloading 100 million records, the first person uses S3 Select to find 10 records only, while the other just gets his/hers 1000 records. All this without needing to download 100 million records.
Even more benefits come from using S3 select in Glacier, from where you can't readily download your files if needed.