How to load table from SQLite from PySpark?
Asked Answered
Q

3

8

I am trying to load a table from an SQLite .db file stored on a local disk. Is there any way to do this in PySpark?

My solution works but not as elegant. I read the table using Pandas though sqlite3. Schema information is not passed (may or may not be a problem). Is there a direct way to load the table without using Pandas?

import sqlite3
import pandas as pd

db_path = 'alocalfile.db'
query = 'SELECT * from ATableToLoad'

conn = sqlite3.connect(db_path)
a_pandas_df = pd.read_sql_query(query, conn)

a_spark_df = SQLContext.createDataFrame(a_pandas_df)

Using JDBC I have not figured out in PySpark.

Quince answered 16/8, 2016 at 22:16 Comment(1)
What schema information? You mean datatypes? Not that sqlite really has them...Extractor
G
4

So first thing, you would need is to startup pyspark with JDBC driver jar in path Download the sqllite jdbc driver and provide the jar path in below . https://bitbucket.org/xerial/sqlite-jdbc/downloads/sqlite-jdbc-3.8.6.jar

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

For explaination of above pyspark command, see below post

Apache Spark : JDBC connection not working

Now here is how you would do it:-

Now to read the sqlite database file, simply read it into spark dataframe

df = sqlContext.read.format('jdbc').\
     options(url='jdbc:sqlite:Chinook_Sqlite.sqlite',\
     dbtable='employee',driver='org.sqlite.JDBC').load()

df.printSchema() to see your schema.

Full Code:- https://github.com/charles2588/bluemixsparknotebooks/blob/master/Python/sqllite_jdbc_bluemix.ipynb

Thanks, Charles.

Glori answered 17/8, 2016 at 16:51 Comment(0)
D
2

Based on @charles gomes answer:

from pyspark.sql import SparkSession

spark = SparkSession.builder\
           .config('spark.jars.packages', 'org.xerial:sqlite-jdbc:3.34.0')\
           .getOrCreate()

df = spark.read.format('jdbc') \
        .options(driver='org.sqlite.JDBC', dbtable='my_table',
                 url='jdbc:sqlite:/my/path/alocalfile.db')\
        .load()

For other JAR versions please reference the Maven Repository

Decompose answered 9/4, 2021 at 10:35 Comment(0)
G
0

Just an alternative way of doing above:

  1. Download the sqlite-jdbc.jar file.

  2. Setup SparkConf:

    SQLITE_JAR_PATH = "<jdbc_jar_path>"
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
    
    conf = SparkConf()
    conf.set("spark.jars", SQLITE_JAR_PATH)
    
    spark = SparkSession.builder \
        .config(conf=conf) \
        .appName("Sample") \
        .getOrCreate()
    
  3. Now, read the table:

    db_url = f"jdbc:sqlite:<.db_file_path>"
    db_table_name = 'employee'
    
    df = spark.read.format('jdbc') \
        .options(driver='org.sqlite.JDBC', dbtable=db_table_name, url=db_url) \
        .load()
    
    df.show()  # To display the DataFrame content
    
Guilder answered 9/7, 2024 at 17:34 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.