amazon-web-servicesaws-glueaws-data-pipeline

Is it possible to update and insert data in AWS Glue database using glue


So I am using AWS pyspark, and have gigabytes of data everyday, which is getting updated. I want to find the id of the data in an existing table in glue database, update if the id already exists and insert if the id does not exist.

Is it possible to do it in AWS glue?

Thanks!


Solution

  • Yes, you can use the Glue Pyspark Extension for this.

    data_sink = glue_context.getSink(
                        path="s3_path",
                        connection_type="s3",
                        updateBehavior="UPDATE_IN_DATABASE",
                        partitionKeys=['partition_column'],
                        compression="snappy",
                        enableUpdateCatalog=True,
                    )
    data_sink.setCatalogInfo(
                    catalogDatabase=database_name,
                    catalogTableName=table_name,
                    )
    data_sink.setFormat("glueparquet")
    data_sink.writeFrame(data_frame)