I'm trying to make storage plugin for Hadoop (hdfs) and Apache Drill. Actually I'm confused and I don't know what to set as port for hdfs:// connection, and what to set for location. This is my plugin:
{
"type": "file",
"enabled": true,
"connection": "hdfs://localhost:54310",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
},
"avro": {
"type": "avro"
}
}
}
So, is ti correct to set localhost:54310 because I got that with command:
hdfs -getconf -nnRpcAddresses
or it is :8020 ?
Second question, what do I need to set for location? My hadoop folder is in:
/usr/local/hadoop
, and there you can find /etc /bin /lib /log ... So, do I need to set location on my datanode, or?
Third question. When I'm connecting to Drill, I'm going through sqlline and than connecting on my zookeeper like:
!connect jdbc:drill:zk=localhost:2181
My question here is, after I make storage plugin and when I connect to Drill with zk, can I query hdfs file?
I'm very sorry if this is a noob question but I haven't find anything useful on internet or at least it haven't helped me. If you are able to explain me some stuff, I'll be very grateful.
As per Drill docs,
{
"type" : "file",
"enabled" : true,
"connection" : "hdfs://10.10.30.156:8020/",
"workspaces" : {
"root" : {
"location" : "/user/root/drill",
"writable" : true,
"defaultInputFormat" : null
}
},
"formats" : {
"json" : {
"type" : "json"
}
}
}
In "connection"
,
put namenode server address.
If you are not sure about this address.
Check fs.default.name
or fs.defaultFS
properties in core-site.xml
.
Coming to "workspaces"
,
you can save workspaces in this. In the above example, there is a workspace
with name root
and location /user/root/drill
.
This is your HDFS location.
If you have files under /user/root/drill
hdfs directory, you can query them using this workspace name.
Example: abc
is under this directory.
select * from dfs.root.`abc.csv`
After successfully creating the plugin, you can start drill and start querying .
You can query any directory irrespective to workspaces.
Say you want to query employee.json
in /tmp/data
hdfs directory.
Query is :
select * from dfs.`/tmp/data/employee.json`