Integrating Spark SQL and Apache Drill through JDBC

Asked 18/2, 2016 at 8:15 Answered 22/4, 2021 at 10:26

Solved hadoop jdbc apache-spark apache-spark-sql apache-drill

I would like to create a Spark SQL DataFrame from the results of a query performed over CSV data (on HDFS) with Apache Drill. I successfully configured Spark SQL to make it connect to Drill via JDBC:

Map<String, String> connectionOptions = new HashMap<String, String>();
connectionOptions.put("url", args[0]);
connectionOptions.put("dbtable", args[1]);
connectionOptions.put("driver", "org.apache.drill.jdbc.Driver");

DataFrame logs = sqlc.read().format("jdbc").options(connectionOptions).load();

Spark SQL performs two queries: the first one to get the schema, and the second one to retrieve the actual data:

SELECT * FROM (SELECT * FROM dfs.output.`my_view`) WHERE 1=0

SELECT "field1","field2","field3" FROM (SELECT * FROM dfs.output.`my_view`)

The first one is successful, but in the second one Spark encloses fields within double quotes, which is something that Drill doesn't support, so the query fails.

Did someone managed to get this integration working?

Thank you!

Freeze answered 18/2, 2016 at 8:15 Comment(0)

you can add JDBC Dialect for this and register the dialect before using jdbc connector

case object DrillDialect extends JdbcDialect {

  def canHandle(url: String): Boolean = url.startsWith("jdbc:drill:")

  override def quoteIdentifier(colName: java.lang.String): java.lang.String = {
    return colName
  }

  def instance = this
}

JdbcDialects.registerDialect(DrillDialect)

Dropout answered 27/5, 2016 at 17:13 Comment(3)

This looks like Scala but OP's question was in Java. – Ferwerda 27/5, 2016 at 17:48

True, I nevertheless accept the answer as it points me to JdbcDialect. Thanks! – Freeze 15/7, 2016 at 8:9

please can you post a pyspark version of the code please? – Basanite 10/7, 2018 at 13:16

This is how the accepted answer code looks in Java:

import org.apache.spark.sql.jdbc.JdbcDialect;

public class DrillDialect extends JdbcDialect {
  @Override
  public String quoteIdentifier(String colName){
    return colName;
  }

  public boolean canHandle(String url){
    return url.startsWith("jdbc:drill:");
  }
}

Before creating the Spark Session register the Dialect:

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.jdbc.JdbcDialects;

public static void main(String[] args) {
    JdbcDialects.registerDialect(new DrillDialect());
    SparkSession spark = SparkSession
      .builder()
      .appName("Drill Dialect")
      .getOrCreate();

     //More Spark code here..

    spark.stop();
}

Tried and tested with Spark 2.3.2 and Drill 1.16.0. Hope it helps you too!

Maquette answered 22/4, 2021 at 10:26 Comment(0)

Recommended topics

Hot tags