How do I set "for fetch only" when querying ibm db2 using the jdbc driver from spark?
Asked Answered
S

1

0

I have some code to query a db2 database that works if I don't include "for fetch only," but returns an error if I do. I was wondering if it's already being done, or how I could set it.

connection_url = f"jdbc:db2://{host}:{port}/{database}:user={username};password={password};"

  df = (spark
    .read
    .format("jdbc")
    .option("driver", "com.ibm.db2.jcc.DB2Driver")
    .option("url",connection_url)
    .option("query",query)
    .load())
  return(df)

Error when I include for fetch only:

com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=for;

and the detailed is:


/databricks/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    162             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    163         else:
--> 164             return self._df(self._jreader.load())
    165 
    166     def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    115     def deco(*a, **kw):
    116         try:
--> 117             return f(*a, **kw)
    118         except py4j.protocol.Py4JJavaError as e:
    119             converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o4192.load.
: com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, SQLSTATE=42601, SQLERRMC=for;
                 
;), DRIVER=4.25.13
    at com.ibm.db2.jcc.am.b6.a(b6.java:810)
    at com.ibm.db2.jcc.am.b6.a(b6.java:66)
    at com.ibm.db2.jcc.am.b6.a(b6.java:140)
    at com.ibm.db2.jcc.am.k3.c(k3.java:2824)
    at com.ibm.db2.jcc.am.k3.d(k3.java:2808)
    at com.ibm.db2.jcc.am.k3.a(k3.java:2234)
    at com.ibm.db2.jcc.am.k4.a(k4.java:8242)
    at com.ibm.db2.jcc.t4.ab.i(ab.java:206)
    at com.ibm.db2.jcc.t4.ab.b(ab.java:96)
    at com.ibm.db2.jcc.t4.p.a(p.java:32)
    at com.ibm.db2.jcc.t4.av.i(av.java:150)
    at com.ibm.db2.jcc.am.k3.al(k3.java:2203)
    at com.ibm.db2.jcc.am.k4.bq(k4.java:3730)
    at com.ibm.db2.jcc.am.k4.a(k4.java:4609)
    at com.ibm.db2.jcc.am.k4.b(k4.java:4182)
    at com.ibm.db2.jcc.am.k4.bd(k4.java:780)
    at com.ibm.db2.jcc.am.k4.executeQuery(k4.java:745)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.getQueryOutputSchema(JDBCRDD.scala:68)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:58)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:241)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:36)
    at   org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:356)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:323)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:323)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:222)
    at sun.reflect.GeneratedMethodAccessor704.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:750)

I've searched ibm's documentation, and stack overflow using every possible permutation I can think of.

I've read documentation about setting the isolation level since I also get a failure when running queries with with ur and was thinking that that if I could find out why that fails, I'd understand why for fetch only fails, (there's an answer here ) but it makes things clear as mud because I couldn't use it to find an analogous solution for for fetch only

I've looked at the db2 documentation on ibm's website, and searched stack overflow but this is eluding me.

edit: queries that run and don't run

Runs in dbvisualizer and pyspark

select
  id_number
from
  myschema.mytable
FETCH FIRST
  10 ROWS ONLY

another one

select
  id_number
from
  myschema.mytable

Runs in dbvisualizer but not in pyspark

select
  id_number
from
  myschema.mytable
FETCH FIRST
  10 ROWS ONLY FOR FETCH ONLY

another one

select
  id_number
from
  myschema.mytable
FOR FETCH ONLY

edit 2:

an example is that I run this code:

connection_url = f"jdbc:db2://{host}:{port}/{database}:user={username};password={password};"

  df = (spark
    .read
    .format("jdbc")
    .option("driver", "com.ibm.db2.jcc.DB2Driver")
    .option("url",connection_url)
    .option("query","""
    select
      id_number
    from
      myschema.mytable
    FOR FETCH ONLY
""")
    .load())
  return(df)

and it doesn't work. and then I run this code:

connection_url = f"jdbc:db2://{host}:{port}/{database}:user={username};password={password};"

  df = (spark
    .read
    .format("jdbc")
    .option("driver", "com.ibm.db2.jcc.DB2Driver")
    .option("url",connection_url)
    .option("query","""
    select
      id_number
    from
      myschema.mytable
    -- FOR FETCH ONLY
""")
    .load())
  return(df)

and it does work. and then I went into dbvisualizer, and verified that both versions of the query do work, so it's not a syntax error from what I can tell.

dbvisualizer says the database major version is 12 and minor is 1 and I believe it's z/os. I'm using the jdbc driver version 4.25.13 in both pyspark and dbvisualizer downloaded from maven here

edit 3:

this query runs fine in db visualizer, but fails in pyspark.

    select
      id_number
    from
      myschema.mytable
    FOR READ ONLY
Savdeep answered 30/1, 2023 at 16:35 Comment(5)
Don't you think showing the actual query that you try to run might be helpful?Abiding
I added some examples of code that run in db visualizer but don't run in pyspark. it's when I add "FOR FETCH ONLY" that a query stops working. I've tried a bunch of variations. It's those three words that causes a problem. "WITH UR" also doesn't work, but I don't use that flag so it's not as big an issue as "FOR FETCH ONLY" not working. Does that help?Savdeep
Are you sure you are connecting to the same database in both cases? What is the database version and platform?Abiding
I've clarified in edit 2 since it's queries. they point to the same database/connection. like I think it's the only difference...no wrong database. DBvisualizer says the database version is 12 and minor is 1 and I believe the os is z/OSSavdeep
I believe that page says "Alternative syntax and synonyms: FOR FETCH ONLY can be specified as a synonym for FOR READ ONLY." I added an edit 3. when I do for read only, it works in dbvisualizer but not in pyspark.Savdeep
S
0

Alright. I found out what's happening. tl;dr: spark already does it.

The documentation here states:

A query that will be used to read data into Spark. The specified query will be parenthesized and used as a subquery in the FROM clause. Spark will also assign an alias to the subquery clause. As an example, spark will issue a query of the following form to the JDBC Source.

SELECT FROM (<user_specified_query>) spark_gen_alias

I'm fairly certain the relevant code is here:

    val sqlText = options.prepareQuery +
      s"SELECT $columnList FROM ${options.tableOrQuery} $myTableSampleClause" +
      s" $myWhereClause $getGroupByClause $getOrderByClause $myLimitClause $myOffsetClause"

So FOR FETCH ONLY falls within the subquery, which is not allowed in DB2.

Fortunately though, it looks like CONCUR_READ_ONLY jdbc option is set, which is equivalent to FOR READ ONLY per documentation here

JDBC setting Db2® cursor setting IBM Informix® cursor setting
CONCUR_READ_ONLY FOR READ ONLY FOR READ ONLY
CONCUR_UPDATABLE FOR UPDATE FOR UPDATE
HOLD_CURSORS_OVER_COMMIT WITH HOLD WITH HOLD
TYPE_FORWARD_ONLY SCROLL not specified SCROLL not specified
TYPE_SCROLL_INSENSITIVE INSENSITIVE SCROLL SCROLL
TYPE_SCROLL_SENSITIVE SENSITIVE STATIC, SENSITIVE DYNAMIC, or ASENSITIVE, depending on the cursorSensitivity Connection and DataSource property Not supported

The relevant code in spark is:

    stmt = conn.prepareStatement(sqlText,
        ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)

from here

As a side note, it looks like even if it wasn't specified explicitly in the code above, CONCUR_READ_ONLY is the default flag for ResultSet in java sql:

Concurrency Description
ResultSet.CONCUR_READ_ONLY Creates a read-only result set. This is the default
ResultSet.CONCUR_UPDATABLE Creates an updateable result set.

source

Savdeep answered 3/2, 2023 at 0:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.