How to add a nested column to a DataFrame

Asked 1/3, 2018 at 9:49 Answered 2/2 at 4:2

Solved scala apache-spark apache-spark-sql

I have a dataframe df with the following schema:

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

What I want to do is add a nested column, say car_brand to my person structure. How would I do it?

The expected final schema would look like this:

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- car_brand: string (nullable = true)

Villainage answered 1/3, 2018 at 9:49 Comment(0)

You can unpack the struct and add it to a new one, including the new column at the same time. For example, adding "bmw" to all persons in the dataframe be done like this:

df.withColumn("person", struct($"person.*", lit("bmw").as("car_brand")))

Byrann answered 1/3, 2018 at 10:1 Comment(4)

This solution does not work with three level nested fields. How can we add nested_col field here? root |-- city_name: string (nullable = true) |-- person: struct (nullable = true) | |-- age: long (nullable = true) | |-- name: string (nullable = true) |-- nested_col: string (nullable = true) – Thrombosis 1/5, 2018 at 0:19

@Thrombosis If I understand correctly you want to add the nested_col to the person struct? You can do this by using the method in the answer, just changelit("bmw") to $"nested_col". – Byrann 1/5, 2018 at 1:19

No, Actually I want to add a column when there is one more level of nesting per say name here, assuming name is struct here just like person. – Thrombosis 1/5, 2018 at 1:36

@Thrombosis If there isn't too many columns then you could do something like follows: withColumn("person", struct($"person.age", struct($"person.name", $"nested_col").as("name))). However, this will change a bit depending on the type of the column (here your name column is a string, not a struct). If you want a more detailed answer for your use case, I would recommend asking a new question on the site with more information (you can leave the link in a comment here). – Byrann 1/5, 2018 at 1:51

Adding a new nested column within person:

df = df.withColumn(
        "person",
        struct(
            $"person.*",
            struct(
                lit("value_1").as("person_field_1"),
                lit("value_2").as("person_field_2"),
            ).as("nested_column_within_person")
       )
    )

Final schema :

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- nested_column_within_person: struct (nullable = true)
 |    |    |-- person_field_1: string (nullable = true)
 |    |    |-- person_field_2: string (nullable = true)

Euonymus answered 17/4, 2019 at 8:16 Comment(1)

Is there any syntax changed in the recent versions? i'm getting SyntaxError: invalid syntax for cdc_df.withColumn("AuditPayload", struct( lit("value_1").as("person_field_1"),lit("value_2").as("person_field_2"),).as("nested_column_within_person"))) – Poussin 18/3, 2020 at 2:18

import pyspark.sql.functions as func
dF = dF.withColumn(
        "person",
   func.struct(
            "person.age",
                func.struct(
                            "person.name",
                            func.lit(None).alias("NestedCol_Name")
                    ).alias("name")
       )
       )
O/P Schema:-
root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = false)
 |    |-- age: string (nullable = true)
 |    |-- name: struct (nullable = false)
 |    |    |-- name: string (nullable = true)
 |    |    |-- NestedCol_Name: null (nullable = true)

Nickeliferous answered 11/12, 2020 at 11:43 Comment(1)

this solution will help to add the nested column at more than one level and as per above example it will add a nested column i.e. ('NestedCol_Name') inside name column which we already have. Hope it answers @Sindhu's question – Nickeliferous 11/12, 2020 at 11:44

The function withField is available starting Spark 3.1. As per the doc, "it can be used to add/replace a nested field in StructType by name".

In this case, it can be used as follows: -

import org.apache.spark.sql.functions

df.withColumn("person", functions.col("person").withField("car_brand", functions.col("some car brand here")))

Dubai answered 2/2 at 4:2 Comment(0)

Recommended topics

Hot tags