Where does Delta Lake store the table metadata info. I am using spark 2.6(Not Databricks) on my standalone machine. My assumption was that if I restart spark, the table created in delta lake spark will be dropped(trying from Jupyter notebook). But it is not the case.
There are two types of tables in Apache Spark: external tables and managed tables. When creating a table using LOCATION
keyword in the CREATE TABLE
statement, it's an external table. Otherwise, it's a managed table and its location is under the directory specified by the Spark SQL conf spark.sql.warehouse.dir
. Its default value is the spark-warehouse
directory in the current work directory
Besides the data, Spark also needs to store the table metadata into Hive Metastore, so that Spark can know where is the data when a user uses the table name to query. Hive Metastore is usually a database. If a user doesn't specify a database for Hive Metastore, Spark will use en embedded database called Derby to store the table metadata on the local file system.
DROP TABLE
command has different behaviors depending on the table type. When a table is a managed table, DROP TABLE
will remove the table from Hive Metastore and delete the data. If the table is an external table, DROP TABLE
will remove the table from Hive Metastore but still keep the data on the file system. Hence, the data files of an external table needs to be deleted from the file system manually by the user.
Delta stores the metadata in _delta_log folder in the same folder as the location of table. It can be stored in HIVE but it depends on the log store configuration.
For more information please read this paper https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf
© 2022 - 2024 — McMap. All rights reserved.
location
in thecreate table
statement). If so, you need to rundrop table
to delete the table from metastore (drop table
doesn't delete the folder used by an external table), and also delete the table folder manually. – Hauberkspark-warehouse
in your current work directory. – Hauberk