pythonpython-requestspython-jsons

Create JSON dynamically reading file from S3


I am working on AWS Glue and writing a a requests program to query Botify (with BQL). I need to have a json (requred for POST) which should be dynamically created with the queried fields. The fields that needs to be queried resides in a text file on S3. We should be able to read the S3 file and create the JSON string as given below.

Also, the field "myId" in the expected JSON should be replaced with the actual id stored in a variable. Please help.

S3 file contents:

date_crawled
content_type
http_code
compliant.is_compliant
compliant.reason.http_code
compliant.reason.canonical

Expected JSON string:-

payload = """
{
  "job_type": "export",
  "payload": {
    "username": "myID",
    "project": "abc123.com",
    "export_size": 50,
    "formatter": "csv",
    "formatter_config": {
            "delimiter": ",",
            "print_delimiter": "False",
            "print_header": "True",
            "header_format": "verbose"
        },
    "connector": "direct_download",
    "extra_config": {},
    "query": {
      "collections": ["crawl.20230515"],
      "query": {
        "dimensions": ["url",
                       "crawl.20230515.date_crawled",
                       "crawl.20230515.content_type",
                       "crawl.20230515.http_code",
                       "compliant.is_compliant",
                       "compliant.reason.http_code",
                       "compliant.reason.canonical"
"
                       ],
        "metrics": [],
        "sort": [1]
      }
    }
  }
}
"""

I am new to Python. So any help is immensely appreciated.

Thanks.


Solution

  • Think this code will be sufficient enough, you also need to setup and configure your AWS credential properly. Hope this helps

    def save_s3_json_file(bucket_name, file_key, my_id):
        s3 = boto3.client('s3')
        
        # read the file content from S3
        response = s3.get_object(Bucket=bucket_name, Key=file_key)
        content = response['Body'].read().decode('utf-8')
        
        # extract dimension names from the file content
        dimensions = content.split('\n')
        
        # JSON structure based on your expectation
        expected_json = {
            "job_type": "export",
            "payload": {
                "username": my_id,  # this "username" will be replaced with the actual ID (based on your ID)
                "project": "abc123.com",
                "export_size": 50,
                "formatter": "csv",
                "formatter_config": {
                    "delimiter": ",",
                    "print_delimiter": "False",
                    "print_header": "True",
                    "header_format": "verbose"
                },
                "connector": "direct_download",
                "extra_config": {},
                "query": {
                    "collections": ["crawl.20230515"],
                    "query": {
                        "dimensions": [
                            "url",
                            "crawl.20230515.date_crawled",
                            "crawl.20230515.content_type",
                            "crawl.20230515.http_code",
                            "compliant.is_compliant",
                            "compliant.reason.http_code",
                            "compliant.reason.canonical"
                        ],
                        "metrics": [],
                        "sort": [1]
                    }
                }
            }
        }
        
        # update the expected JSON dictionaries,
        # and returned as appropriate JSON object
        expected_json['query']['query']['dimensions'] += dimensions
        return json.dumps(expected_json, indent=2)