How to drop a column from a Databricks Delta table?

R

7

26

I have recently started discovering Databricks and faced a situation where I need to drop a certain column of a delta table. When I worked with PostgreSQL it was as easy as

ALTER TABLE main.metrics_table 
DROP COLUMN metric_1;

I was looking through Databricks documentation on DELETE but it covers only DELETE the rows that match a predicate.

I've also found docs on DROP database, DROP function and DROP table but absolutely nothing on how to delete a column from a delta table. What am I missing here? Is there a standard way to drop a column from a delta table?

Realist answered 31/1, 2019 at 9:15 Comment(0)

S

19

There is no drop column option on Databricks tables: https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#delta-schema-constructs

Remember that unlike a relational database there are physical parquet files in your storage, your "table" is just a schema that has been applied to them.

In the relational world you can update the table metadata to remove a column easily, in a big data world you have to re-write the underlying files.

Technically parquet can handle schema evolution (see Schema evolution in parquet format). But the Databricks implementation of Delta does not. It probably just too complicated to be worth it.

Therefore the solution in this case is to create a new table and insert the columns you want to keep from the old table.

Snigger answered 31/1, 2019 at 10:7 Comment(4)

Yeah, it's complicated when you try to add a new column that has the same name as a deleted column. – Avenge 16/5, 2019 at 1:30

Now, creating the new table from the old table will still leave the data in the data files, correct? – Abreact 4/8, 2021 at 15:2

@Snigger could you please update your answer to reflect that dropping a column is now supported as of Delta 2.0 release? (both Databricks and otherwise) docs.delta.io/latest/delta-batch.html#drop-columns – Dilute 31/8, 2022 at 0:1

There is a DROP COLUMN option in public preview since DB 11.0 – Greylag 19/9, 2022 at 9:27

T

14

use below code :

df = spark.sql("Select * from <DB Name>.<Table Name>")

df1 = df.drop("<Column Name>")

spark.sql("DROP TABLE if exists <DB Name>.<TableName>_OLD")

spark.sql("ALTER TABLE <DB Name>.<TableName> RENAME TO <DB Name>.<Table Name>_OLD ")

df1.write.format("delta").mode("OVERWRITE").option("overwriteSchema", "true").saveAsTable("<DB Name>.<Table Name>")

Throb answered 10/11, 2020 at 20:11 Comment(4)

hasn't dropped the column .. I guess you need to physically delete the file / do vacuum – Stephi 5/2, 2021 at 6:34

It will not preserve the automatic value generation of columns made using GENERATED option. – Brunt 26/4 at 9:40

@EugeneLycenok, it works fine for me. Are you sure you used the same commands as above? – Brunt 26/4 at 9:40

yes, Databricks might have been fixed the issue as overwriteSchema should intuitively do the job but it did not – Stephi 28/4 at 6:15

O

12

Databricks Runtime 10.2+ supports dropping columns if you enable Column Mapping mode

ALTER TABLE <table_name> SET TBLPROPERTIES (
  'delta.minReaderVersion' = '2',
  'delta.minWriterVersion' = '5',
  'delta.columnMapping.mode' = 'name'
)

And then drops will work --

ALTER TABLE table_name DROP COLUMN col_name
ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2, ...)

Onanism answered 27/6, 2022 at 15:59 Comment(3)

This does not work for me. When setting the TBLPROPOERTIES, I get the following error message : Error in SQL statement: ParseException: no viable alternative at input 'ALTER TABLE '/my_dir/my_table''. Any idea why ? – Mack 7/10, 2022 at 7:46

@Mack use backticks for the path! – Onanism 10/12, 2022 at 0:33

Careful, this change in properties can break downstream operations (such as reading from Synapse SQL) and it cannot be disabled, see limitations in documentation: docs.delta.io/latest/delta-column-mapping.html – Unreconstructed 24/1 at 12:46

D

7

One way that I figured out to make that work is to first drop the table and then recreate the table from the dataframe using the overwriteSchema option to true. You also need to use the option of mode = overwrite so that it recreate the physical files using new schema that the dataframe contains.

Break down of the steps :

Read the table in the dataframe.
Drop the columns that you don't want in your final table
Drop the actual table from which you have read the data.
now save the newly created dataframe after dropping the columns as the same table name.
but make sure you use two options at the time of saving the dataframe as table.. (.mode("overwrite").option("overwriteSchema", "true") )

Above steps would help you recreate the same table with the extra column/s removed. Hope it helps someone facing the similar issue.

Darmstadt answered 30/9, 2020 at 15:3 Comment(0)

K

4

You can overwrite the table without the column if the table isn't too large.

df = spark.read.table('table')
df = df.drop('col')
df.write.format('delta')\
        .option("overwriteSchema", "true")\
        .mode('overwrite')\
        .saveAsTable('table')

Kaiulani answered 30/5, 2022 at 11:50 Comment(1)

be cautious about partitioning – Chopping 26/4, 2023 at 9:0

J

2

As of Delta Lake 1.2, you can drop columns, see the latest ALTER TABLE docs.

Here's a fully working example if you're interested in a snippet you can run locally:

# create a Delta Lake
columns = ["language","speakers"]
data = [("English", "1.5"), ("Mandarin", "1.1"), ("Hindi", "0.6")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

df.write.format("delta").saveAsTable("default.my_cool_table")

spark.sql("select * from `my_cool_table`").show()

+--------+--------+
|language|speakers|
+--------+--------+
|Mandarin|     1.1|
| English|     1.5|
|   Hindi|     0.6|
+--------+--------+

Here's how to drop the language column:

spark.sql("""ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
   'delta.columnMapping.mode' = 'name',
   'delta.minReaderVersion' = '2',
   'delta.minWriterVersion' = '5')""")

spark.sql("alter table `my_cool_table` drop column language")

Verify that the language column isn't included in the table anymore:

spark.sql("select * from `my_cool_table`").show()

+--------+
|speakers|
+--------+
|     1.1|
|     1.5|
|     0.6|
+--------+

Josi answered 30/7, 2022 at 14:0 Comment(1)

This does not work for me. When setting the TBLPROPOERTIES, I get the following error message : Error in SQL statement: ParseException: no viable alternative at input 'ALTER TABLE '/my_dir/my_table''. Any idea why ? – Mack 7/10, 2022 at 7:46

B

1

It works only if you added your column after creating the table.

If it is so, and if it is possible for you to recover the data inserted after altering your table, you may consider using the table history to restore the table to a previous version.

With

DESCRIBE HISTORY <TABLE_NAME>

you can check all the available versions of your table (operation 'ADD COLUMN' will create a new table version).

Afterwards, with RESTORE it is possible to transform the table to any available state.

RESTORE <TALBE_NAME> VERSION AS OF <VERSION_NUMBER>

Here you have more information about TIME TRAVEL

Blah answered 11/5, 2022 at 9:25 Comment(0)

Recommended topics

Hot tags