I have ORC files stored in different folders on HDFS as follows:
/DATA/UNIVERSITY/DEPT/STUDENT/part-00000.orc
/DATA/UNIVERSITY/DEPT/CREDIT/part-00000.orc
I do not know how many columns are present in each of those tables (STUDENT
, CREDIT
, etc). Is there a way to get the schemas from these files? I'm looking for obtaining columns names and their data types so that I could write CREATE statements for Hive EXTERNAL tables.
Found a way to get the details (column names, data types) I wanted via Spark
data = sqlContext.sql("SELECT * FROM orc.`<HDFS_path>`");
data.printSchema()
This will print the output in below format, which is exaclty the information I want to extract from the ORC files on HDFS:
root
|-- <column_name1>: <datatype> (nullable = <true/false>)
|-- <column_name2>: <datatype> (nullable = <true/false>)
|-- <column_name3>: <datatype> (nullable = <true/false>)
|-- <column_name4>: <datatype> (nullable = <true/false>)
|-- <column_name5>: <datatype> (nullable = <true/false>)