I think you should use array
and explode
to do this, you do not need any complex logic with UDFs or custom functions.
array
will combine columns into a single column, or annotate columns.
explode
will convert an array column into a set of rows.
All you need to do is:
- annotate each column with you custom label (eg. 'milk')
- combine your labelled columns into a single column of 'array' type
- explode the labels column to generate labelled rows
- drop irrelevant columns
df = (
df.withColumn('labels', F.explode( # <-- Split into rows
F.array( # <-- Combine columns
F.array(F.lit('milk'), F.col('qty_on_hand_milk')), # <-- Annotate column
F.array(F.lit('bread'), F.col('qty_on_hand_bread')),
F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')),
)
))
.withColumn('CATEGORY', F.col('labels')[0])
.withColumn('qty_on_hand', F.col('labels')[1])
).select('store_id', 'CATEGORY', 'qty_on_hand')
Note how you can pull out elements of an array column simply using col('foo')[INDEX]
;
there is no specific need to split them out into separate columns.
This approach is also robust over different data types, because it doesn't try to force the same schema over every row (unlike using a struct).
eg. If 'qty_on_hand_bread' is a string, this still works, the resulting schema will just be:
root
|-- store_id: long (nullable = false)
|-- CATEGORY: string (nullable = true)
|-- qty_on_hand: string (nullable = true) <-- Picks best schema on the fly
Here is the same code, step by step to make it obvious what's happening here:
import databricks.koalas as ks
import pyspark.sql.functions as F
# You don't need koalas, it's just less verbose for adhoc dataframes
df = ks.DataFrame({
"store_id": [100, 200, 300],
"qty_on_hand_milk": [30, 55, 20],
"qty_on_hand_bread": [105, 85, 125],
"qty_on_hand_eggs": [35, 65, 90],
}).to_spark()
df.show()
# Annotate each column with your custom label per row. ie. v -> ['label', v]
df = df.withColumn('label1', F.array(F.lit('milk'), F.col('qty_on_hand_milk')))
df = df.withColumn('label2', F.array(F.lit('bread'), F.col('qty_on_hand_bread')))
df = df.withColumn('label3', F.array(F.lit('eggs'), F.col('qty_on_hand_eggs')))
df.show()
# Create a new column which combines the labeled values in a single column
df = df.withColumn('labels', F.array('label1', 'label2', 'label3'))
df.show()
# Split into individual rows
df = df.withColumn('labels', F.explode('labels'))
df.show()
# You can now do whatever you want with your labelled rows, eg. split them into new columns
df = df.withColumn('CATEGORY', F.col('labels')[0])
df = df.withColumn('qty_on_hand', F.col('labels')[1])
df.show()
...and the output from each step:
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs|
+--------+----------------+-----------------+----------------+
| 100| 30| 105| 35|
| 200| 55| 85| 65|
| 300| 20| 125| 90|
+--------+----------------+-----------------+----------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3|
+--------+----------------+-----------------+----------------+----------+------------+----------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|
+--------+----------------+-----------------+----------------+----------+------------+----------+
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[[milk, 30], [bre...|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]|[[milk, 55], [bre...|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[[milk, 20], [bre...|
+--------+----------------+-----------------+----------------+----------+------------+----------+--------------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [milk, 30]|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [eggs, 35]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [milk, 55]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [eggs, 65]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [milk, 20]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [eggs, 90]|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
|store_id|qty_on_hand_milk|qty_on_hand_bread|qty_on_hand_eggs| label1| label2| label3| labels|CATEGORY|qty_on_hand|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [milk, 30]| milk| 30|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]|[bread, 105]| bread| 105|
| 100| 30| 105| 35|[milk, 30]|[bread, 105]|[eggs, 35]| [eggs, 35]| eggs| 35|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [milk, 55]| milk| 55|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [bread, 85]| bread| 85|
| 200| 55| 85| 65|[milk, 55]| [bread, 85]|[eggs, 65]| [eggs, 65]| eggs| 65|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [milk, 20]| milk| 20|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]|[bread, 125]| bread| 125|
| 300| 20| 125| 90|[milk, 20]|[bread, 125]|[eggs, 90]| [eggs, 90]| eggs| 90|
+--------+----------------+-----------------+----------------+----------+------------+----------+------------+--------+-----------+
+--------+--------+-----------+
|store_id|CATEGORY|qty_on_hand|
+--------+--------+-----------+
| 100| milk| 30|
| 100| bread| 105|
| 100| eggs| 35|
| 200| milk| 55|
| 200| bread| 85|
| 200| eggs| 65|
| 300| milk| 20|
| 300| bread| 125|
| 300| eggs| 90|
+--------+--------+-----------+