I started using AWS Data Wrangler
and Athena
, to upload my data files onto S3, and being able to query them, respectively. My question is about the procedure to "safely" update the data in the table.
Here is what I did:
I used the AWS Data Wrangler,
wr.s3.to_parquet(
df=my_df,
path='s3://temp',
dataset=True,
partition_cols = ['date'],
concurrent_partitioning = True,
mode = append
)
to export my DataFrame
to S3 -as parquet
files.
I used AWS Glue to crawl my data files in S3, and create Table A
.
Via Athena, I can query my data in Table A
.
Now I would like to update my data in table A. My questions are:
Table A
.
This is not my desire. Is there a safe way to do this?Table A
. Using AWS Wrangler, How can I achieve this, without adding redundant rows in Table A
?I have tested (2), using mode=append
and indeed I ended up getting redundant rows in my Table A
.
I appreciate if you kindly share your thoughts
Thanks,
You can use the below API:
awswrangler.s3.merge_upsert_table
Ref: https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.merge_upsert_table.html