Multiplication of members of two arrays
Asked Answered
C

3

5

I have the following table:

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()

cols = [  'a1',   'a2']
data = [([2, 3], [4, 5]),
        ([1, 3], [2, 4])]

df = spark.createDataFrame(data, cols)
df.show()
#  +------+------+
#  |    a1|    a2|
#  +------+------+
#  |[2, 3]|[4, 5]|
#  |[1, 3]|[2, 4]|
#  +------+------+

I know how to multiply array by a scalar. But how to multiply members of one array with corresponding members of another array?

Desired result:

#  +------+------+-------+
#  |    a1|    a2|    res|
#  +------+------+-------+
#  |[2, 3]|[4, 5]|[8, 15]|
#  |[1, 3]|[2, 4]|[2, 12]|
#  +------+------+-------+
Caudex answered 28/6, 2021 at 14:29 Comment(0)
M
4

Similarly to your example, you can access the 2nd array from the transform function. This assumes that both arrays have same length:

from pyspark.sql.functions import expr

cols = [  'a1',   'a2']
data = [([2, 3], [4, 5]),
        ([1, 3], [2, 4])]

df = spark.createDataFrame(data, cols)

df = df.withColumn("res", expr("transform(a1, (x, i) -> a2[i] * x)"))

# +------+------+-------+
# |    a1|    a2|    res|
# +------+------+-------+
# |[2, 3]|[4, 5]|[8, 15]|
# |[1, 3]|[2, 4]|[2, 12]|
# +------+------+-------+
Mannerless answered 28/6, 2021 at 16:2 Comment(1)
Thank you, this version looks very smooth.Caudex
S
4

Assuming you can have arrays with different sizes:

from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

spark = SparkSession.builder.getOrCreate()

cols = ['a1', 'a2']
data = [([2, 3], [4, 5]),
        ([1, 3], [2, 4]),
        ([1, 3], [2, 4, 6])]

df = spark.createDataFrame(data, cols)
df = df.withColumn("res", expr("transform(arrays_zip(a1, a2), x -> coalesce(x.a1 * x.a2, 0))"))

df.show(truncate=False)
# +------+---------+----------+
# |a1    |a2       |res       |
# +------+---------+----------+
# |[2, 3]|[4, 5]   |[8, 15]   |
# |[1, 3]|[2, 4]   |[2, 12]   |
# |[1, 3]|[2, 4, 6]|[2, 12, 0]|
# +------+---------+----------+
Saintsimonianism answered 28/6, 2021 at 19:56 Comment(1)
Thank you, it's really wise to think about it! I think I may use this in the future.Caudex
F
1

Use User Defined Functions(UDF) to create a function to perform the multiplication and the call this function.

def sum(x, y):
    return [x[0] * y[0], x[1] * y[1]]

sum_cols = udf(sum, ArrayType(IntegerType()))

df1 = df.withColumn("res", sum_cols('a1', 'a2'))

df1.show()

+------+------+-------+
|    a1|    a2|    res|
+------+------+-------+
|[2, 3]|[4, 5]|[8, 15]|
|[1, 3]|[2, 4]|[2, 12]|
+------+------+-------+

https://docs.databricks.com/spark/latest/spark-sql/udf-python.html

Flaunch answered 28/6, 2021 at 15:33 Comment(1)
Thanks for a great answer. In this case I decided to use udf-free approach.Caudex

© 2022 - 2024 — McMap. All rights reserved.