hadoopjdbcapache-sparkapache-spark-sqlapache-drill

Integrating Spark SQL and Apache Drill through JDBC


I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC:

Map<String, String> connectionOptions = new HashMap<String, String>();
connectionOptions.put("url", args[0]);
connectionOptions.put("dbtable", args[1]);
connectionOptions.put("driver", "org.apache.drill.jdbc.Driver");

DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();

Spark SQL performs two queries: the first one to get the schema, and the second one to retrieve the actual data:

SELECT * FROM (SELECT * FROM dfs.output.`my_view`) WHERE 1=0

SELECT "field1","field2","field3" FROM (SELECT * FROM dfs.output.`my_view`)

The first one is successful, but in the second one Spark encloses fields within double quotes, which is something that Drill doesn't support, so the query fails.

Did someone managed to get this integration working?

Thank you!


Solution

  • you can add JDBC Dialect for this and register the dialect before using jdbc connector

    case object DrillDialect extends JdbcDialect {
    
      def canHandle(url: String): Boolean = url.startsWith("jdbc:drill:")
    
      override def quoteIdentifier(colName: java.lang.String): java.lang.String = {
        return colName
      }
    
      def instance = this
    }
    
    JdbcDialects.registerDialect(DrillDialect)