why does spark appends 'WHERE 1=0' at the end of sql query

Asked 16/2, 2018 at 12:42 Answered 10/4 at 13:44

I am trying to execute a simple mysql query using Apache Spark and create a data frame. But for some reasons spark appends 'WHERE 1=0' at the end of the query which I want to execute and throws an exception stating 'You have an error in your SQL syntax'.

val spark = SparkSession.builder.master("local[*]").appName("rddjoin"). getOrCreate()
 val mhost = "jdbc:mysql://localhost:3306/registry"
val mprop = new java.util.Properties
mprop.setProperty("driver", "com.mysql.jdbc.Driver")mprop.setProperty("user", "root")
mprop.setProperty("password", "root")
val q= """select id from loaded_item"""
val res=spark.read.jdbc(mhost,q,mprop)
res.show(10)

And the exception is as below:

18/02/16 17:53:49 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'select id from loaded_item WHERE 1=0' at line 1
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
    at com.mysql.jdbc.Util.getInstance(Util.java:408)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3973)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3909)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2484)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
    at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:62)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
    at GenerateReport$.main(GenerateReport.scala:46)
    at GenerateReport.main(GenerateReport.scala)
18/02/16 17:53:50 INFO SparkContext: Invoking stop() from shutdown hook

Pumpernickel answered 16/2, 2018 at 12:42 Comment(0)

The second parameter of your call to spark.read.jdbc is not correct. Instead of specifing a sql query, you should either use a table name qualified with schema or a valid SQL query with an alias. In your case this would be val q="registry.loaded_item". Another option if you want to provide addional parameters (maybe for a where statement) is to use the other versions of DataframeReader.jdbc.

Btw: the reason why you see the strange looking query WHERE 1=0 is that Spark tries to infer the schema of your data frame without loading any actual data. This query is guaranteed never to deliver any results, but the query result's metadata can be used by Spark to get the schema information.

Ive answered 16/2, 2018 at 16:5 Comment(4)

Thanks Werner! That explains a lot. Actually I am trying to execute a big query and which has this table. select distinct i.id from asset a JOIN item i ON a.latest_item_id = i.id JOIN loaded_item li ON li.id = i.id INNER JOIN jrn_article ja ON i.id = ja.id INNER JOIN publication p ON a.publication_id = p.id where li.dead = false and a.dead = false and (p.id is null or p.dead = false) and a.item_type = :itemType and a.origin_id = :originId and li.item_status IN ('DELIVER_PASS') Thats the query. Not really sure how should I go with creating dataframe for this. any help appreciated. – Pumpernickel 17/2, 2018 at 18:4

You have three options: 1. create a view in your databse and then load the view as described in the answer. 2. create Spark dataframes for all tables that are part of your query (asset, item, loaded_item, ...) and then use dataframe logic to get the wanted results 3. hide your query behind a subquery as described here – Ive 17/2, 2018 at 19:31

A bit of correction: that 2nd parameter (dbtable) can be either table name qualified with schema or a valid SQL query as told here. However in case it's a query, it has to be enclosed in parenthesis and aliased like this "(select id from registry.loaded_item) AS sql" – Twobyfour 8/5, 2018 at 12:3

@Twobyfour Thanks for the hint. I have updated my answer. – Ive 8/5, 2018 at 12:16

This way I got result:

mysql_url = "jdbc:mysql://mysqlhost:3306/dbname"
mysql_username = "dbUser"
mysql_password = "dbPass"
mysql_properties = {
  "user": mysql_username,
  "password": mysql_password,
  "driver": "com.mysql.cj.jdbc.Driver"
}

# Execute select query
query = "(select * from table  limit 1 ) as tabletemp"
df = spark.read.jdbc(url=mysql_url, table=query, properties=mysql_properties)
df.show()

Danczyk answered 10/4 at 13:44 Comment(0)

Recommended topics

Hot tags