apache-sparkpysparkddl

Save a result of printSchema() function to variable in Pyspark?


I'm using the printSchema function to infer schema of Json file. I want to save the result of this function call in a variable to parse it line by line so that I can extract a structure of a schema and convert it in a DDL schema for creating a table in hive.

How can this be done?


Solution

  • If you inspect the source code for printSchema(), you will see that this function just does the following:

    print(self._jdf.schema().treeString())
    

    Therefore, you can save the output as follows:

    printSchemaString = df._jdf.schema().treeString()
    

    Other references: