Delta Lake Table metadata
Asked Answered
T

2

5

Where does Delta Lake store the table metadata info. I am using spark 2.6(Not Databricks) on my standalone machine. My assumption was that if I restart spark, the table created in delta lake spark will be dropped(trying from Jupyter notebook). But it is not the case.

Tight answered 21/7, 2020 at 10:22 Comment(7)
Where is the location you store the table? It will be stored on the storage/file system you specify. It won't be deleted automatically.Hauberk
@Hauberk I am trying it in my local windows environment and all my tables are under C:/tmp. Even if I delete the folder the table meta data information is keptTight
Yep. This is expected. I guess you are using external tables (using location in the create table statement). If so, you need to run drop table to delete the table from metastore (drop table doesn't delete the folder used by an external table), and also delete the table folder manually.Hauberk
@Hauberk Thanks for your comment.Yes, I am using the "location" in the create table statement. One question, where does the table meta information(say in this case the available tables) stored, since I dont have hive in my local window environment.Tight
If you don't config Hive, it will use derby which provides the metastore functions using your local file system. By default, you should be able to see a directory spark-warehouse in your current work directory.Hauberk
@Hauberk Thanks a lot.. that clarifies.Tight
Cool. I will wrap up the above discussions to an answer.Hauberk
H
11

There are two types of tables in Apache Spark: external tables and managed tables. When creating a table using LOCATION keyword in the CREATE TABLE statement, it's an external table. Otherwise, it's a managed table and its location is under the directory specified by the Spark SQL conf spark.sql.warehouse.dir. Its default value is the spark-warehouse directory in the current work directory

Besides the data, Spark also needs to store the table metadata into Hive Metastore, so that Spark can know where is the data when a user uses the table name to query. Hive Metastore is usually a database. If a user doesn't specify a database for Hive Metastore, Spark will use en embedded database called Derby to store the table metadata on the local file system.

DROP TABLE command has different behaviors depending on the table type. When a table is a managed table, DROP TABLE will remove the table from Hive Metastore and delete the data. If the table is an external table, DROP TABLE will remove the table from Hive Metastore but still keep the data on the file system. Hence, the data files of an external table needs to be deleted from the file system manually by the user.

Hauberk answered 24/7, 2020 at 0:56 Comment(0)
M
0

Delta stores the metadata in _delta_log folder in the same folder as the location of table. It can be stored in HIVE but it depends on the log store configuration.

For more information please read this paper https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf

Myology answered 21/3, 2023 at 3:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.