From the documentation:
The Avro package provides function to_avro to encode a column as binary in Avro format, and from_avro() to decode Avro binary data into a column. Both functions transform one column to another column, and the input/output SQL data type can be a complex type or a primitive type.
The function to_avro acts as replacement for the convertToAvro
method:
import static org.apache.spark.sql.avro.functions.*;
//put the avro schema of the struct column into a string
//in my example I assume that the struct consists of a two fields:
//a long field (s1) and a string field (s2)
String schema = "{\"type\":\"record\",\"name\":\"mystruct\"," +
"\"namespace\":\"topLevelRecord\",\"fields\":[{\"name\":\"s1\"," +
"\"type\":[\"long\",\"null\"]},{\"name\":\"s2\",\"type\":" +
"[\"string\",\"null\"]}]},\"null\"]}";
data = ...
//add an additional column containing the struct as binary column
Dataset<Row> data2 = df.withColumn("to_avro", to_avro(data.col("myStruct"), schema));
df2.printSchema();
df2.show(false);
prints
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- mystruct: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
|-- to_avro: binary (nullable = true)
+----+----+----------+----------------------------+
|a |b |mystruct |to_avro |
+----+----+----------+----------------------------+
|foo1|bar1|[1, one] |[00 02 00 06 6F 6E 65] |
|foo2|bar2|[3, three]|[00 06 00 0A 74 68 72 65 65]|
+----+----+----------+----------------------------+
To convert the avro column back, the function from_avro can be used:
Dataset<Row> data3 = data2.withColumn("from_avro", from_avro(data2.col("to_avro"), schema));
df3.printSchema();
df3.show();
Output:
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
|-- mystruct: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
|-- to_avro: binary (nullable = true)
|-- from_avro: struct (nullable = true)
| |-- s1: long (nullable = true)
| |-- s2: string (nullable = true)
+----+----+----------+--------------------+----------+
| a| b| mystruct| to_avro| from_avro|
+----+----+----------+--------------------+----------+
|foo1|bar1| [1, one]|[00 02 00 06 6F 6...| [1, one]|
|foo2|bar2|[3, three]|[00 06 00 0A 74 6...|[3, three]|
+----+----+----------+--------------------+----------+
A word about the udf: in the question you performed the transformation to the avro format within the udf. I would prefer to include only the actual business logic in the udf and keep the format transformation outside. This separates the logic and the format transformation. If necessary, you can drop the original column mystruct
after creating the avro column.