We need to ignore a few paths while crawling through a specific path. Below are the details:
Include Path: s3://dev-bronze/api/sp/reports/xyz/
Exclude Path: brand=abc/client=xxx/**
Full path : "s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/"
We want to ignore a few client's data. So I am using the above glob but it doesn't seem to work. Any help will be highly appreciated.
Clarifying the difference between exclude patterns brand=abc/client=xxx/**
and brand=abc/client=xxx**
(note the missing /
).
Exclude pattern brand=abc/client=xxx/**
matches:
s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/<subfolder1>/file1.txt
s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/<subfolder2>/file2.txt
This pattern will match objects in all subfolders of brand=abc/client=xxx/
.
Exclude pattern brand=abc/client=xxx**
matches:
s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/file1.txt
s3://dev-bronze/api/sp/reports/xyz/brand=abc/client=xxx/file2.txt
This pattern will match all objects in brand=abc/client=xxx/
.
If you want to exclude files in brand=abc/client=xxx/
, then use the exclude pattern brand=abc/client=xxx**
.
Reference: Crawler Properties > Include and Exclude Patterns (AWS)