I know that this has been asked before. But I have spent hours trying to get this to work.
I have a directory structure like:
- datalake
--- datasets
----- foo
------- 00001.json
------- 00002.json
------- latest.json
----- bar
------- 00001.json
------- latest.json
my include path looks like
s3:<bucket_name>/datalake/datasets/
i want to exclude things that are not latest.json
s
I have tried everything under the sun.
**0*
**/0**
*/0*
*0*
**0**
and many others.
Without fail, my crawler catalogs every .json.
I am checking the results of my crawl with Athena.
Am I seriously getting the exclude pattern wrong? Or am I somehow thinking about this entire thing the wrong way and my pattern is irrelevant?
For me, the answer ended up being related to the fact that I was using Athena to look at the updated catalog. According to this:
Athena will not respect the exclusion of glue files.