Run a sql query on a PySpark DataFrame
Asked Answered
C

2

10

I am using Databricks and I already have loaded some DataTables.

However, I have a complex SQL query that I want to operate on these data tables, and I wonder if i could avoid translating it in pyspark.

Is that possible?

To give an example:

In SQL:

with table2 as (
   select column1, column1
   from database.table1
   where
         start_date <= DATE '2019-03-01' and
         end_date >= DATE '2019-03-31'            )

In pyspark I would already have table1 loaded but the following does not work because it can not find table1.

query = "(
    select column1, column1
           from table1
               where
         start_date <= DATE '2019-03-01' and
         end_date >= DATE '2019-03-31'            )"
table2 = spark.sql(query)

Thanks

Conductance answered 7/8, 2019 at 10:43 Comment(1)
Did you register it as temp table in pyspark? or it is from Hive?Linkage
K
7

Try giving databasename.tablename instead of tablename in query.

query = "(
    select column1, column1
           from *database_name.table_name* 
               where
         start_date <= DATE '2019-03-01' and
         end_date >= DATE '2019-03-31' )" 

If you are using pyspark then it must be

spark.sql(query)
Kobe answered 22/9, 2019 at 12:11 Comment(2)
does not work, the query is not even valid python string.Suspense
This doesn't answer the question. OP said to assume that a DataFrame table1 exists in the Python session. Your response just queries a database table.Elda
P
3
  • After Spark 3.4, SparkSession.sql supports parameterized SQL. So you can just make it like this:
# spark -> your SparkSession object

table1 = spark.sql('''
select column1, column1
from database.table1
where
    start_date <= DATE '2019-03-01' 
    and end_date >= DATE '2019-03-31'
'''
)

# just reference table1 as keyword argument of `.sql` function.
table2 = spark.sql('''
select column1, column1
from {tb}
    where start_date <= DATE '2019-03-01' 
    and end_date >= DATE '2019-03-31'
''', tb=table1)
  • Before Spark 3.4, it's a little more complicated, but also quite easy:
table1 = spark.sql('''
select column1, column1
from database.table1
where
    start_date <= DATE '2019-03-01' 
    and end_date >= DATE '2019-03-31'
'''
)

# create a temporary view
table1.createOrReplaceTempView('view_table1')

# now you can reference to the view by name
table2 = spark.sql('''
select column1, column1
from view_table1
    where start_date <= DATE '2019-03-01' 
    and end_date >= DATE '2019-03-31'
''')

Pernas answered 1/12, 2023 at 7:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.