amazon-web-servicesaws-glueaws-glue-data-catalog

AWS Glue Crawler glob Exclude Pattern functionality


We need to ignore a few paths while crawling through a specific path. Below are the details:

Include Path: s3://dev-bronze/api/sp/reports/xyz/
Exclude Path: brand=abc/client=xxx/**

Full path : "s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/"

We want to ignore a few client's data. So I am using the above glob but it doesn't seem to work. Any help will be highly appreciated.


Solution

  • Clarifying the difference between exclude patterns brand=abc/client=xxx/** and brand=abc/client=xxx** (note the missing /).

    Exclude pattern brand=abc/client=xxx/** matches:

    s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/<subfolder1>/file1.txt
    s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/<subfolder2>/file2.txt
    

    This pattern will match objects in all subfolders of brand=abc/client=xxx/.

    Exclude pattern brand=abc/client=xxx** matches:

    s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/file1.txt
    s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/file2.txt
    

    This pattern will match all objects in brand=abc/client=xxx/.

    If you want to exclude files in brand=abc/client=xxx/, then use the exclude pattern brand=abc/client=xxx**.

    Reference: Crawler Properties > Include and Exclude Patterns (AWS)