aws-glueaws-glue-data-catalog

AWS Crawler S3 Target Path Changes But Old Path Tables Included


I have a AWS Crawler which I am switching the s3 target path in order to switch the underlying table source. The problem is that the tables are being created from both targets:

configuration:

aws glue get-crawler --name sand-main 
{
    "Crawler": {
        "Name": "sand-main",
        "Role": "Crawler-sand",
        "Targets": {
            "S3Targets": [
                {
                    "Path": "s3://sand-main-green/main",
                    "Exclusions": [
                        "checkpoints/**",
                        "IsActive.txt",
                        "isactive.txt"
                    ]
                }
            ],
            "JdbcTargets": [],
            "MongoDBTargets": [],
            "DynamoDBTargets": [],
            "CatalogTargets": []
        },
        "DatabaseName": "sand_main",
        "Description": "",
        "Classifiers": [],
        "RecrawlPolicy": {
            "RecrawlBehavior": "CRAWL_EVERYTHING"
        },
        "SchemaChangePolicy": {
            "UpdateBehavior": "UPDATE_IN_DATABASE",
            "DeleteBehavior": "DELETE_FROM_DATABASE"
        },
        "LineageConfiguration": {
            "CrawlerLineageSettings": "DISABLE"
        },
        "State": "READY",
        "CrawlElapsedTime": 0,
        "CreationTime": "2020-09-30T14:07:25-06:00",
        "LastUpdated": "2021-01-28T11:32:15-07:00",
        "LastCrawl": {
            "Status": "SUCCEEDED",
            "LogGroup": "/aws-glue/crawlers",
            "LogStream": "sand-main",
            "MessagePrefix": "5bb1907d-2847-46ef-8712-3a50deb2b7a0",
            "StartTime": "2021-01-28T11:32:35-07:00"
        },
        "Version": 24,
        "Configuration": "{\"Version\":1.0,\"CrawlerOutput\":{\"Partitions\":{\"AddOrUpdateBehavior\":\"InheritFromTable\"}},\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
    }
}

The path I have a lambda that will switch from: "Path": "s3://sand-main-green/main" To: "Path": "s3://sand-main-blue/main"

But I end up with tables:

Name -> Location
test -> s3://sand-main-blue/main/test

test_2398l50df -> s3://sand-main-green/main/test

I have DELETE_IN_DATABASE so I would expect the old s3 paths to be deleted. It feels like the crawler retains the history of its s3 targets. I do not want this behavior


Solution

  • Usually crawler creates table with last part of file path as table name (in your example "test"). If table is already present in database, it creates new table with random characters as postfix (in your example test_2398l50df).

    If you want table "test" to be set to new path, you should follow steps in below order:

    1. Run crawler with location s3://sand-main-blue/main/test (this creates "test" table)
    2. Delete "test" table in database
    3. Update crawler with new path (s3://sand-main-green/main/test)
    4. Run crawler (this creates "test" table with new path).