apache-sparkpysparkapache-spark-sqlpyspark-schema

How to write a schema for below nested Json pyspark


How to write schema for below json :

  "place_results": {
      "title": "W2A Architects",
      "place_id": "ChIJ4SUGuHw5xIkRAl0856nZrBM",
      "data_id": "0x89c4397cb80625e1:0x13acd9a9e73c5d02",
      "data_cid": "1417747306467056898",
      "reviews_link": "httpshl=en",
      "photos_link": "https=en",
      "gps_coordinates": {
        "latitude": 40.6027801,
        "longitude": -75.4701499
      },
      "place_id_search": "http",
      "rating": 3.7,

I am getting nulls while writing below schema. How to know the correct datatype to use?

 StructField('place_results', StructType([
                                                                StructField('address', StringType(), True), 
                                                                StructField('data_cid', StringType(), True), 
                                                                StructField('data_id', StringType(), True), 
                                                                StructField('gps_coordinates', StringType(), True), 
                                                                StructField('open_state', StringType(), True), 
                                                                StructField('phone', StringType(), True), 
                                                                StructField('website', StringType(), True)
                                                ])),   

Solution

  • This should work:

    StructType([
      StructField('place_results', 
                  StructType([
                    StructField('data_cid', StringType(), True), 
                    StructField('data_id', StringType(), True), 
                    StructField('gps_coordinates', StructType([
                      StructField('latitude', DoubleType(), True),
                      StructField('longitude', DoubleType(), True)]), True), 
                    StructField('photos_link', StringType(), True), 
                    StructField('place_id', StringType(), True), 
                    StructField('place_id_search', StringType(), True), 
                    StructField('rating', DoubleType(), True), 
                    StructField('reviews_link', StringType(), True), 
                    StructField('title', StringType(), True)]), True)
    ])
    

    I got this using this command:

    spark.read.option("multiLine", True).json("dbfs:/test/sample.json").schema