apachehadoophdfsapache-drillhadoop-plugins

Making storage plugin on Apache Drill to HDFS


I'm trying to make storage plugin for Hadoop (hdfs) and Apache Drill. Actually I'm confused and I don't know what to set as port for hdfs:// connection, and what to set for location. This is my plugin:

 {
 "type": "file",
 "enabled": true,
 "connection": "hdfs://localhost:54310",
 "workspaces": {
 "root": {
  "location": "/",
  "writable": false,
  "defaultInputFormat": null
},
"tmp": {
  "location": "/tmp",
  "writable": true,
  "defaultInputFormat": null
}
 },
"formats": {
  "psv": {
  "type": "text",
  "extensions": [
    "tbl"
  ],
  "delimiter": "|"
},
"csv": {
  "type": "text",
  "extensions": [
    "csv"
  ],
  "delimiter": ","
},
"tsv": {
  "type": "text",
  "extensions": [
    "tsv"
  ],
  "delimiter": "\t"
},
"parquet": {
  "type": "parquet"
},
"json": {
  "type": "json"
},
"avro": {
  "type": "avro"
   }
 }
}

So, is ti correct to set localhost:54310 because I got that with command:

 hdfs -getconf -nnRpcAddresses 

or it is :8020 ?

Second question, what do I need to set for location? My hadoop folder is in:

/usr/local/hadoop

, and there you can find /etc /bin /lib /log ... So, do I need to set location on my datanode, or?

Third question. When I'm connecting to Drill, I'm going through sqlline and than connecting on my zookeeper like:

  !connect jdbc:drill:zk=localhost:2181 

My question here is, after I make storage plugin and when I connect to Drill with zk, can I query hdfs file?

I'm very sorry if this is a noob question but I haven't find anything useful on internet or at least it haven't helped me. If you are able to explain me some stuff, I'll be very grateful.


Solution

  • As per Drill docs,

      {
        "type" : "file",
        "enabled" : true,
        "connection" : "hdfs://10.10.30.156:8020/",
        "workspaces" : {
          "root" : {
            "location" : "/user/root/drill",
            "writable" : true,
            "defaultInputFormat" : null
          }
        },
        "formats" : {
          "json" : {
            "type" : "json"
          }
        }
      }
    

    In "connection",

    put namenode server address.

    If you are not sure about this address. Check fs.default.name or fs.defaultFS properties in core-site.xml.

    Coming to "workspaces",

    you can save workspaces in this. In the above example, there is a workspace with name root and location /user/root/drill. This is your HDFS location.

    If you have files under /user/root/drill hdfs directory, you can query them using this workspace name.

    Example: abc is under this directory.

     select * from dfs.root.`abc.csv`
    

    After successfully creating the plugin, you can start drill and start querying .

    You can query any directory irrespective to workspaces.

    Say you want to query employee.json in /tmp/data hdfs directory.

    Query is :

    select * from dfs.`/tmp/data/employee.json`