apache-spark-sqlazure-hdinsight

Spark SQL: How to consume json data from a REST service as DataFrame


I need to read some JSON data from a web service thats providing REST interfaces to query the data from my SPARK SQL code for analysis. I am able to read a JSON stored in the blob store and use it.

I was wondering what is the best way to read the data from a REST service and use it like a any other DataFrame.

BTW I am using SPARK 1.6 of Linux cluster on HD insight if that helps. Also would appreciate if someone can share any code snippets for the same as I am still very new to SPARK environment.


Solution

  • On Spark 1.6:

    If you are on Python, use the requests library to get the information and then just create an RDD from it. There must be some similar library for Scala (relevant thread). Then just do:

    json_str = '{"executorCores": 2, "kind": "pyspark", "driverMemory": 1000}'
    rdd = sc.parallelize([json_str])
    json_df = sqlContext.jsonRDD(rdd)
    json_df
    

    Code for Scala:

    val anotherPeopleRDD = sc.parallelize(
      """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
    val anotherPeople = sqlContext.read.json(anotherPeopleRDD)
    

    This is from: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets