Change the datatype of a column in delta table

Asked 31/5, 2021 at 6:59 Answered 25/1, 2023 at 9:56

apache-spark apache-spark-sql delta-lake

Is there a SQL command that I can easily use to change the datatype of a existing column in Delta table. I need to change the column datatype from BIGINT to STRING. Below is the SQL command I'm trying to use but no luck.

%sql ALTER TABLE [TABLE_NAME] ALTER COLUMN [COLUMN_NAME] STRING

Error I'm getting:

org.apache.spark.sql.AnalysisException
ALTER TABLE CHANGE COLUMN is not supported for changing column 'bam_user' with type 
'IntegerType' to 'bam_user' with type 'StringType'

Cantara answered 31/5, 2021 at 6:59 Comment(3)

what error do you get? – Suffragan 31/5, 2021 at 9:14

@AlexOtt, added the error message in the question. – Cantara 31/5, 2021 at 9:40

Do you have any idea or reference why it doesn't work? – Subchloride 7/7, 2021 at 12:10

SQL doesn't support this, but it can be done in python:

from pyspark.sql import functions as F

# set dataset location and columns with new types
table_path = '/mnt/dataset_location...'
types_to_change = {
    'column_1' : 'int',
    'column_2' : 'string',
    'column_3' : 'double'
}

# load to dataframe, change types
df = (
    spark.read
    .format('delta')
    .load(table_path)
    .withColumns(
        {
            col: F.col(col).cast(typ)
            for col, typ in types_to_change.items()
        }
    )
)

# save df with new types overwriting the schema
(
    df.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", True)
    .save(f"dbfs:{table_path}")
)

Stringy answered 10/11, 2022 at 15:25 Comment(0)

Suppose you want to change data type of column "column_name" to "int" of table "delta_table_name"

from pyspark.sql import functions as F

(
    spark.read
    .table("delta_table_name")
    .withColumn(
        "Column_name",
        F.col("Column_name").cast("new_data_type")
    )
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", True)
    .saveAsTable("delta_table_name")
)

Allochthonous answered 19/11, 2022 at 15:46 Comment(0)

Read the table using spark.
Use withColumn method to transform the column you want.
Write the table back, mode overwrite and overwriteSchema True

Reference: https://docs.databricks.com/delta/update-schema.html#explicitly-update-schema-to-change-column-type-or-name

from pyspark.sql import functions as F

(
    spark.read
    .table("<TABLE NAME>")
    .withColumn(
        "<COLUMN NAME>",
        F.col("<COLUMN NAME>").cast("<DATA TYPE>")
    )
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", True)
    .saveAsTable("<TABLE NAME>")
)

Pinky answered 25/1, 2023 at 9:56 Comment(0)

No Option to change the data type of column or dropping the column. You can read the data in datafame, modify the data type and with help of withColumn() and drop() and overwrite the table.

Instrumentalist answered 10/6, 2021 at 17:31 Comment(1)

What other options if the table is very large which needs additional time and compute? – Cantara 11/6, 2021 at 5:15

There is no real way to do this using SQL, unless you copy to a different table altogether. This option includes INSERT data to a new table, DROP TABLE and re-CREATE with the new structure and therefore risky.

The way to do this in python is as follows:

Let's say this is your table:

CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING);
INSERT INTO person VALUES
    (100, 'John', 30, 1, 'Street 1'),
    (200, 'Mary', NULL, 1, 'Street 2'),
    (300, 'Mike', 80, 3, 'Street 3'),
    (400, 'Dan', 50, 4, 'Street 4');

You can check the table structure using the following:

DESCRIBE TABLE person

IF you need to change the id to String:

This is the code:

from pyspark.sql import functions as F

df = spark.read.table("person")

df1 = df.withColumn("id", F.col("id").cast("string")) 

(
    df1.write
    .format("parquet")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("person")
)

Couple of pointers: the format is parquet in this table. That's the default for Databricks. So you can omit the "format" line (note that Python is very sensitive regarding spaces).

Re databricks:

If the format is "delta" you must specify this.

Also, if the table is partitioned, it's important to mention that in the code: For example:

(
    df1.write 
    .format("delta") 
    .mode("overwrite") 
    .partitionBy("col_to_partition1", "col_to_partition2") 
    .option("overwriteSchema", "true") 
    .save(table_location)
)

When table_location is where the delta table is saved.

(some of this answer is based on this)

Lardaceous answered 24/3, 2022 at 0:59 Comment(0)

Re databricks:

Recommended topics

Hot tags