Unzip folder stored in Azure Databricks FileStore

Asked 16/1, 2020 at 9:49 Answered 15/8 at 10:26

azure apache-spark databricks

I've uploaded a *.zip folder to my Azure Datacricks FileStore:

Now I would like to unzip the folder and store it on FileStore: dbfs:/FileStore/tables/rfc_model.

I know it should be easy but I get confused working in the DB Notebooks ...

Thank you for help!

UPDATE:

I've used this formulas with no success:

%sh unzip /FileStore/tables/rfc_model.zip

and

%sh unzip dbfs:/FileStore/tables/rfc_model.zip

UPDATE:

I've copied the code created by @Sim into my Databricks notebook but this error appears:

Any idea how to fix this?

Excruciation answered 16/1, 2020 at 9:49 Comment(2)

what did you try? – Doordie 16/1, 2020 at 10:38

in python you can use Zipfile36 for this as well. – Chiccory 21/1, 2020 at 6:35

When you use %sh you are executing shell commands on the driver node using its local filesystem. However, /FileStore/ is not in the local filesystem, which is why you are experiencing the problem. You can see that by trying:

%sh ls /FileStore
# ls: cannot access '/FileStore': No such file or directory

vs.

dbutils.fs.ls("/FileStore")
// resX: Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(...)

You have to either use an unzip utility that can work with the Databricks file system or you have to copy the zip from the file store to the driver disk, unzip and then copy back to /FileStore.

You can address the local file system using file:/..., e.g.,

dbutils.fs.cp("/FileStore/file.zip", "file:/tmp/file.zip")

Hope this helps.

Side note 1: Databricks file system management is not super intuitive, esp when it comes to the file store. For example, in theory, the Databricks file system (DBFS) is mounted locally as /dbfs/. However, /dbfs/FileStore does not address the file store, while dbfs:/FileStore does. You are not alone. :)

Side note 2: if you need to do this for many files, you can distribute the work to the cluster workers by creating a Dataset[String] with the file paths and than ds.map { name => ... }.collect(). The collect action will force execution. In the body of the map function you will have to use shell APIs instead of %sh.

Side note 3: a while back I used the following Scala utility to unzip on Databricks. Can't verify it still works but it could give you some ideas.

  def unzipFile(zipPath: String, outPath: String): Unit = {
    val fis = new FileInputStream(zipPath)
    val zis = new ZipInputStream(fis)
    val filePattern = """(.*/)?(.*)""".r
    println("Unzipping...")
    Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
      // @todo need a consistent path handling abstraction
      //       to address DBFS mounting idiosyncracies
      val dirToCreate = outPath.replaceAll("/dbfs", "") + filePattern.findAllMatchIn(file.getName).next().group(1)
      dbutils.fs.mkdirs(dirToCreate)
      val filename = outPath + file.getName
      if (!filename.endsWith("/")) {
        println(s"FILE: ${file.getName} to $filename")
        val fout = new FileOutputStream(filename)
        val buffer = new Array[Byte](1024)
        Stream.continually(zis.read(buffer)).takeWhile(_ != -1).foreach(fout.write(buffer, 0, _))
      }
    }
  }

Fighter answered 21/1, 2020 at 6:0 Comment(4)

One question: When you copy a folder with subfolders to DBFS it literally moves everything in one folder without my schema. How would you do this kind of copying? – Excruciation 22/1, 2020 at 8:42

@Excruciation I am not sure what you mean by "without my schema". Files/folders don't have a schema. Schema is typically associated with Spark reading data from files/folders. – Fighter 23/1, 2020 at 3:24

I was referring to file structure, like Folder - Subfolders - Files, when I drag and drop Folder with subfolders to DBFS it is not stored as original structure. – Excruciation 23/1, 2020 at 10:28

I never rely on UI for critical data operations as UI behavior can change outside of my control. I'd recommend using code for any data migrations you have to do. Azure storage can be mounted in DBFS to /mnt/.... dbutils.fs.cp(from, to, recurse = true) will preserve folder structure... but it does all the work from the driver so it can be slow. Internally, we have utilities that can move/copy directories using cluster workers, e.g., you can distribute a list of directories or from/to paths in a dataset and call dbutils.fs.* from a flatMap() operation. – Fighter 25/1, 2020 at 22:15

This works:

%sh
unzip /dbfs/FileStore/tables/rfc_model.zip

The results need to be copied to destination in dbfs if needed.

%sh
cp rfc_model /dbfs/FileStore/tables

Animalcule answered 7/5, 2020 at 12:2 Comment(0)

This work:

%sh unzip /dbfs/FileStore/tables/rfc_model.zip -d /dbfs/FileStore/tables/rfc_model

Mayst answered 3/1, 2023 at 18:47 Comment(1)

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Openair 4/1, 2023 at 3:5

You can also use zipfile

import zipfile


zObj = zipfile.ZipFile('/dbfs/FileStore/[pathtozipfile].zip')

#target unzipped location
FileLocation = '/dbfs/FileStore/Unzipped'

zObj.extractall(FileLocation)

Wiper answered 15/8 at 10:26 Comment(0)

Recommended topics

Hot tags