aws-glueaws-glue-data-catalog

AWS Glue job consuming data from external REST API


I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please help!


Solution

  • Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start). When is finished it triggers a Spark type job that reads only the json items I need. I use the requests pyhton library.

    In order to save the data into S3 you can do something like this

    import boto3
    import json
    
    # Initializes S3 client
    s3 = boto3.resource('s3')
    
    tweets = []
    //Code that extracts tweets from API
    tweets_json = json.dumps(tweets)
    obj = s3.Object("my-tweets", "tweets.json")
    obj.put(Body=data)