I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark:
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("url", url) \
.option("tempDir", temp_dir) \
.save()
This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error:
Py4JJavaError: An error occurred while calling o1252.save. : com.databricks.spark.sqldw.SqlDWSideException: SQL DW failed to execute the JDBC query produced by the connector.
Underlying SQLException(s): - com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: HadoopSqlException: String or binary data would be truncated. [ErrorCode = 107090] [SQLState = S0001]
As I understand it, this is because the default string type is NVARCHAR(256). It is possible to configure (reference), but the maximum NVARCHAR length is 4k characters. My strings occasionally reach 10k characters. Therefore, I am curious as to how I can export certain columns as text/longtext instead.
I would guess that the following would work, if only the preActions
were executed after table was created. It's not, and therefore it fails.
sdf.write \
.format("com.databricks.spark.sqldw") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "test_table") \
.option("url", url) \
.option("tempDir", temp_dir) \
.option("preActions", "ALTER TABLE test_table ALTER COLUMN value NVARCHAR(MAX);") \
.save()
Also, postActions
are executed after data is inserted, and therefore this will also fail.
Any ideas?