I wanted to retrieve the data stored onto Hadoop Cloudera cluster either via Hive, Spark or SQL. I have SQL query written which should fetch data from the cluster. But prior to that, I want to understand how to set up connection /Cursor with cluster so that it will know where to read from or write to?
sc = spark.sparkContext
or similarly HIVECONTEXT or SPARKCONTEXT will not suffice.
We might need to give URL for node and all. So how to do that?
Any Small example would suffice.
There are two ways to create the table in the hive:
1- Creating an external table schema:
CREATE EXTERNAL TABLE IF NOT EXISTS names_text(
student_ID INT, FirstName STRING, LastName STRING,
year STRING, Major STRING)
COMMENT 'Student Names'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/andrena';
2- a) Create the schema for a managed table:
CREATE TABLE IF NOT EXISTS Names(
student_ID INT, FirstName STRING, LastName STRING,
year STRING, Major STRING)
COMMENT 'Student Names'
STORED AS ORC;
b) Move the external table data to the managed table:
INSERT OVERWRITE TABLE Names SELECT * FROM names_text;
And finally, verify that the Hive warehouse stores the student names in the external and internal table respectively :
SELECT * FROM names_text;
SELECT * from Names;