I am working on AWS Glue and writing a a requests
program to query Botify (with BQL). I need to have a json (requred for POST) which should be dynamically created with the queried fields. The fields that needs to be queried resides in a text file on S3. We should be able to read the S3 file and create the JSON string as given below.
Also, the field "myId" in the expected JSON should be replaced with the actual id stored in a variable. Please help.
S3 file contents:
date_crawled
content_type
http_code
compliant.is_compliant
compliant.reason.http_code
compliant.reason.canonical
Expected JSON string:-
payload = """
{
"job_type": "export",
"payload": {
"username": "myID",
"project": "abc123.com",
"export_size": 50,
"formatter": "csv",
"formatter_config": {
"delimiter": ",",
"print_delimiter": "False",
"print_header": "True",
"header_format": "verbose"
},
"connector": "direct_download",
"extra_config": {},
"query": {
"collections": ["crawl.20230515"],
"query": {
"dimensions": ["url",
"crawl.20230515.date_crawled",
"crawl.20230515.content_type",
"crawl.20230515.http_code",
"compliant.is_compliant",
"compliant.reason.http_code",
"compliant.reason.canonical"
"
],
"metrics": [],
"sort": [1]
}
}
}
}
"""
I am new to Python. So any help is immensely appreciated.
Thanks.
Think this code will be sufficient enough, you also need to setup and configure your AWS credential properly. Hope this helps
def save_s3_json_file(bucket_name, file_key, my_id):
s3 = boto3.client('s3')
# read the file content from S3
response = s3.get_object(Bucket=bucket_name, Key=file_key)
content = response['Body'].read().decode('utf-8')
# extract dimension names from the file content
dimensions = content.split('\n')
# JSON structure based on your expectation
expected_json = {
"job_type": "export",
"payload": {
"username": my_id, # this "username" will be replaced with the actual ID (based on your ID)
"project": "abc123.com",
"export_size": 50,
"formatter": "csv",
"formatter_config": {
"delimiter": ",",
"print_delimiter": "False",
"print_header": "True",
"header_format": "verbose"
},
"connector": "direct_download",
"extra_config": {},
"query": {
"collections": ["crawl.20230515"],
"query": {
"dimensions": [
"url",
"crawl.20230515.date_crawled",
"crawl.20230515.content_type",
"crawl.20230515.http_code",
"compliant.is_compliant",
"compliant.reason.http_code",
"compliant.reason.canonical"
],
"metrics": [],
"sort": [1]
}
}
}
}
# update the expected JSON dictionaries,
# and returned as appropriate JSON object
expected_json['query']['query']['dimensions'] += dimensions
return json.dumps(expected_json, indent=2)