Reading zip file into Apache Spark dataframe

Asked 7/7, 2022 at 2:13 Answered 15/5 at 7:9

python python-3.x apache-spark pyspark apache-spark-sql

Using Apache Spark (or pyspark) I can read/load a text file into a spark dataframe and load that dataframe into a sql db, as follows:

df = spark.read.csv("MyFilePath/MyDataFile.txt", sep="|", header="true", inferSchema="true")
df.show()
.............
#load df into an SQL table
df.write(.....)

Question: How can we achieve the same if the data file is inside a zip file? The zip file has only one text file of size 6GB

Ottie answered 7/7, 2022 at 2:13 Comment(1)

see How to open/stream .zip files through Spark? – Akanke 7/7, 2022 at 6:12

I have create a sample dataset employee.txt which is in .zip folder. I have used pandas Lib to read the zipped compressed txt file. Might be there would be multiple approach but this is the best approach.

Records:employee.txt

Name;dept;age
Ravi kumar;Data Science;29
Amitesh Kumar;QA;29
Rohit Kumar;Sales;29
Ahimanyu;java;29

# import required modules
import zipfile
import pandas as pd

# read the dataset using the compression zip
pdf = pd.read_csv(r'C:\Users\ravi\Documents\pyspark test\dataset\employee.zip',compression='zip', sep=';')

# creating spark session and coverting pandas dataframe to spark datafram
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("zip reader").getOrCreate()
sparkDF=spark.createDataFrame(pdf)
print(sparkDF.show())

#mysql connection details

driver = "com.mysql.jdbc.Driver"
url = "jdbc:mysql://127.0.0.1:3306/test"
user = "root"
pwd = "India@123"

#writing final output to RDMS 
sparkDF.write.format("jdbc").option("driver", driver)\
    .option("url", url)\
    .option("dbtable", "employee")\
    .option("user", user)\
    .option("password", pwd)\
    .save()

Final Output:

+-------------+------------+---+
|         Name|        dept|age|
+-------------+------------+---+
|   Ravi kumar|Data Science| 29|
|Amitesh Kumar|          QA| 29|
|  Rohit Kumar|       Sales| 29|
|     Ahimanyu|        java| 29|
+-------------+------------+---+

Endocrinotherapy answered 7/7, 2022 at 4:2 Comment(3)

I am using Apache Spark, is there a workaround for this? compression attribute of their spark.read.cs method does not seems to support zip. At the end of their page, you'll find: compression value can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). CSV built-in functions ignore this option. – Ottie 7/7, 2022 at 15:12

This is best approach to read zip file into spark dataframe otherwise you have to store the zip content into rdd then convert into df. dataframe reader does not supports zip compression. I have used this sparkDF=spark.read.format("csv").option("delimiter", ";").option("compression", "zip").load(r'C:\Users\Admin\Documents\pyspark test\dataset\employee.zip') And it is giving error like " pyspark.sql.utils.IllegalArgumentException: Codec [zip] is not available. Known codecs are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.". Above method is the best approach, I recommend you try this . – Endocrinotherapy 7/7, 2022 at 17:29

As mentioned in my post, the data file is too big (6GB). Pandas is known to be too slow (or prone to errors) when dealing with huge datasets. Our experience has shown pandas to be taking either more than 24 hours to just load data into db or more often then not, giving up in the middle of those 24 hours. Whereas, on the other hand Apache Spark takes less than 20 minutes to load the same file into our same db. We're using Databricks and Azure SQL Db. Files are stored in Azure Data Lake Storage Gen2 – Ottie 7/7, 2022 at 17:49

Spark supports the following compression formats:

bzip2
deflate
snappy
lz4
gzip

The compression format should be automatically detected, but you can specify it when you read the file with e.g. .option("compression", "snappy").

Thus, there is no native support for the zip format: you'll need to use one of the supported compression formats. Tools like 7zip can compress to bzip2 and gzip formats. You can even compress to gzip using Java, e.g. GZIP compression to a byte array

One alternative is to write a custom decompressor for zipformat, e.g. https://mcmap.net/q/1330715/-spark-scala-opening-zipped-csv-files

Monotint answered 15/5 at 7:9 Comment(0)

Recommended topics

Hot tags