DataBricks: Ingesting CSV data to a Delta Live Table in Python triggers "invalid characters in table name" error - how to set column mapping mode?
Asked Answered
R

3

8

First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones.

I am reading a CSV file like this:

df1 = spark.read.format("csv").option("header", True).load(path_to_csv_file)

Then I'm saving it as a Delta Live Table like this:

df1.write.format("delta").save("table_path")

The CSV headers have characters in them like space and & and /, and I get the error:

AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'. For more details, refer to https://docs.databricks.com/delta/delta-column-mapping.html Or you can use alias to rename it.

The documentation I've seen on the issue explains how to set the column mapping mode to 'name' AFTER a table has been created using ALTER TABLE, but does not explain how to set it at creation time, especially when using the DataFrame API as above. Is there a way to do this?

Is there a better way to get CSV into a new table?


UPDATE:

Reading the docs here and here, and inspired by Robert's answer, I tried this first:

spark.conf.set("spark.databricks.delta.defaults.columnMapping.mode", "name")

Still no luck, I get the same error. It's interesting how hard it is for a beginner to write a CSV file with spaces in its headers to a Delta Live Table

Rickyrico answered 16/6, 2022 at 9:4 Comment(0)
R
11

Thanks to Hemant on the Databricks community forum, I have found the answer.

df1.write.format("delta").option("delta.columnMapping.mode", "name")
                         .option("path", "table_path").saveAsTable("new_table")

Now I can either query it with SQL or load it into a Spark dataframe:

SELECT * FROM new_table;
delta_df = spark.read.format("delta").load("table_path")
display(delta_df)

SQL Way

This method does the same thing but in SQL.

First, create a CSV-backed table for your CSV file:

CREATE TABLE table_csv
  USING CSV
  OPTIONS (path '/path/to/file.csv', 'header' 'true', 'mode' 'FAILFAST');

Then create a Delta table using the CSV-backed table:

CREATE TABLE delta_table
  USING DELTA
  TBLPROPERTIES ("delta.columnMapping.mode" = "name")
  AS SELECT * FROM table_csv;

SELECT * FROM delta_table;

I've verified that I get the same error as I did when using Python should I omit the TBLPROPERTIES statement.

I guess the Python answer would be to use spark.sql and run this using Python, that way I could embed the CSV path variable in the SQL.

Rickyrico answered 16/6, 2022 at 20:59 Comment(0)
G
2

The following worked for me

spark.conf.set("spark.databricks.delta.properties.defaults.columnMapping.mode","name")

In your original post, ".properties." was missing

Gulfweed answered 16/3, 2023 at 18:55 Comment(1)
I'd go with this answer as this was the answer posted internally.Sedulity
C
0

You can set the option in the Spark Configuration of the cluster you are using. That is how you enable the mode at runtime.

You could also set the config at runtime like this:

spark.conf.set("spark.databricks.<name-of-property>", <value>)
Caravan answered 16/6, 2022 at 9:8 Comment(1)
I've added a section to my question about what I tried (having corrected part of your answer) which unfortunately didn't work.Rickyrico

© 2022 - 2024 — McMap. All rights reserved.