PySpark "explode" dict in column

-RECORD 17----------------------------------------------------------------- item | 20380109 true_recoms | {"5556867":1,"5801144":5,"7397596":21}

schema_json = StructType(fields=[ StructField("item", StringType()), StructField("recoms", StringType()) ]) df.select(col("true_recoms"),from_json(col("true_recoms"), schema_json)).show(5) +--------+--------------------+------+ | item| true_recoms|true_r| +--------+--------------------+------+ |31746548|{"32731749":3,"31...| [,]| |17359322|{"17359392":1,"17...| [,]| |31480894|{"31480598":1,"31...| [,]| | 7265665|{"7265891":1,"503...| [,]| |31350949|{"32218698":1,"31...| [,]| +--------+--------------------+------+ only showing top 5 rows

The schema is incorrectly defined. You declare to be as struct with two string fields

item
recoms

while neither field is present in the document.

Unfortunately from_json can take return only structs or array of structs so redefining it as

MapType(StringType(), LongType())

is not an option.

Personally I would use an udf

from pyspark.sql.functions import udf, explode
import json

@udf("map<string, bigint>")
def parse(s):
    try:
        return json.loads(s)
    except json.JSONDecodeError:
        pass

which can be applied like this

df = spark.createDataFrame(
    [(31746548, """{"5556867":1,"5801144":5,"7397596":21}""")],
    ("item", "true_recoms")
)

df.select("item",  explode(parse("true_recoms")).alias("recom_item", "recom_cnt")).show()
# +--------+----------+---------+
# |    item|recom_item|recom_cnt|
# +--------+----------+---------+
# |31746548|   5801144|        5|
# |31746548|   7397596|       21|
# |31746548|   5556867|        1|
# +--------+----------+---------+

Recommended topics

Hot tags