Spark Dataframe distinguish columns with duplicated name

R

13

151

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:

[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125231, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=145831, f=SparseVector(5, {0: 0.0, 1: 0.2356, 2: 0.0036, 3: 0.0, 4: 0.4132})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=147031, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=149231, f=SparseVector(5, {0: 0.0, 1: 0.0032, 2: 0.2451, 3: 0.0, 4: 0.0042}))
]

Above result is created by join with a dataframe to itself, you can see there are 4 columns with both two a and f.

The problem is is there when I try to do more calculation with the a column, I cant find a way to select the a, I have try df[0] and df.select('a'), both returned me below error mesaage:

AnalysisException: Reference 'a' is ambiguous, could be: a#1333L, a#1335L.

Is there anyway in Spark API that I can distinguish the columns from the duplicated names again? or maybe some way to let me change the column names?

Ramtil answered 18/11, 2015 at 11:16 Comment(0)

A

79

I would recommend that you change the column names for your join.

df1.select(col("a") as "df1_a", col("f") as "df1_f")
   .join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))

The resulting DataFrame will have schema

(df1_a, df1_f, df2_a, df2_f)

Alvardo answered 18/11, 2015 at 11:33 Comment(8)

You may need to fix your answer since the quotes aren't adjusted properly between column names. – Meiosis 20/1, 2018 at 10:13

@SamehSharaf I assume that you are the one down voting my answer? But the answer is in fact 100% correct - I'm simply using the scala '-shorthand for column selection, so there is in fact no problem with quotes. – Alvardo 20/1, 2018 at 11:57

@GlennieHellesSindholt, fair point. It is confusing because the answer is tagged as python and pyspark. – Gingersnap 8/4, 2018 at 9:59

What if each dataframe contains 100+ columns and we just need to rename one column name that is the same? Surely, can't manually type in all those column names in the select clause – Softshoe 25/2, 2020 at 16:31

In that case you could go with df1.withColumnRenamed("a", "df1_a") – Alvardo 26/2, 2020 at 14:0

@GlennieHellesSindholt would you be able to write an pyspark equivalent of this answer? please – Anemic 30/6, 2020 at 8:27

@Dee Just have a look at the answer below from zero323. – Alvardo 1/7, 2020 at 11:44

@GlennieHellesSindholt Wondering if schema change approach could solve my issue: #63966539 – Fernandina 20/9, 2020 at 21:28

S

161

Lets start with some data:

from pyspark.mllib.linalg import SparseVector
from pyspark.sql import Row

df1 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=125231, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0047, 3: 0.0, 4: 0.0043})),
])

df2 = sqlContext.createDataFrame([
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
    Row(a=107831, f=SparseVector(
        5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
])

There are a few ways you can approach this problem. First of all you can unambiguously reference child table columns using parent columns:

df1.join(df2, df1['a'] == df2['a']).select(df1['f']).show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

You can also use table aliases:

from pyspark.sql.functions import col

df1_a = df1.alias("df1_a")
df2_a = df2.alias("df2_a")

df1_a.join(df2_a, col('df1_a.a') == col('df2_a.a')).select('df1_a.f').show(2)

##  +--------------------+
##  |                   f|
##  +--------------------+
##  |(5,[0,1,2,3,4],[0...|
##  |(5,[0,1,2,3,4],[0...|
##  +--------------------+

Finally you can programmatically rename columns:

df1_r = df1.select(*(col(x).alias(x + '_df1') for x in df1.columns))
df2_r = df2.select(*(col(x).alias(x + '_df2') for x in df2.columns))

df1_r.join(df2_r, col('a_df1') == col('a_df2')).select(col('f_df1')).show(2)

## +--------------------+
## |               f_df1|
## +--------------------+
## |(5,[0,1,2,3,4],[0...|
## |(5,[0,1,2,3,4],[0...|
## +--------------------+

Santiagosantillan answered 18/11, 2015 at 11:44 Comment(5)

Thanks for your editing for showing so many ways of getting the correct column in those ambiguously cases, I do think your examples should go into the Spark programming guide. I've learned a lot! – Ramtil 18/11, 2015 at 13:3

small correction: df2_r = **df2** .select(*(col(x).alias(x + '_df2') for x in df2.columns)) instead of df2_r = df1.select(*(col(x).alias(x + '_df2') for x in df2.columns)). For the rest, good stuff – Rose 21/10, 2019 at 13:36

I agree with this should be part of the Spark programming guide. Pure gold. I was able to finally untangle the source of ambiguity selecting columns by the old names before doing the join. The solution of programmatically appending suffixes to the names of the columns before doing the join all the ambiguity wnet away. – Speechmaker 12/4, 2020 at 19:53

@Ramtil : Did you understand why the renaming was needed df1_a = df1.alias("df1_a") and why we can't use df1 and df2 directly? This answer did not explain why the renaming was needed to make select('df1_a.f') work – Zolazoldi 1/2, 2021 at 17:35

@Zolazoldi It's in application to the original problem where there is one table df being joined with itself. Perhaps the solution would make more sense if it had written df.alias("df1_a") and df.alias("df2_a"). – Ultrasonics 3/3, 2021 at 23:41

G

82

There is a simpler way than writing aliases for all of the columns you are joining on by doing:

df1.join(df2,['a'])

This works if the key that you are joining on is the same in both tables.

See https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html

Grammer answered 19/6, 2018 at 16:55 Comment(5)

this is the actual answer as of Spark 2+ – Cricket 13/11, 2018 at 16:44

And for Scala: df1.join(df2, Seq("a")) – Lipoid 28/1, 2019 at 12:36

page was moved to: kb.databricks.com/data/… – Copenhagen 21/6, 2019 at 14:27

Glad I kept scrolling, THIS is the much better answer. If columns have different names, then no ambiguity issue. If columns have the same name, do this. There is little reason to every need to deal with ambiguous col names with this method. – Millur 26/1, 2021 at 16:56

I am doing the same but I am joining based on the two columns, this will work with more than one column? if yes, then I don't know why it is not working for me. df1.join(df2,['a','b']) – Lebna 14/9, 2022 at 16:45

A

79

I would recommend that you change the column names for your join.

df1.select(col("a") as "df1_a", col("f") as "df1_f")
   .join(df2.select(col("a") as "df2_a", col("f") as "df2_f"), col("df1_a" === col("df2_a"))

The resulting DataFrame will have schema

(df1_a, df1_f, df2_a, df2_f)