Joining two DataFrames in Spark SQL and selecting columns of only one

Asked 2/8, 2016 at 13:2 Answered 25/5 at 6:38

Solved scala apache-spark apache-spark-sql

I have two DataFrames in Spark SQL (D1 and D2).

I am trying to inner join both of them D1.join(D2, "some column") and get back data of only D1, not the complete data set.

Both D1 and D2 are having the same columns.

Could some one please help me on this?

I am using Spark 1.6.

Striking answered 2/8, 2016 at 13:2 Comment(0)

Let say you want to join on "id" column. Then you could write :

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._    
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")

Haygood answered 2/8, 2016 at 13:8 Comment(3)

Hi , This answer helps . I have a silly question . What does the $ sign imply? I get error when i try to write the $ sign in my code. – Striking 2/8, 2016 at 13:32

It's an alternative way of accessing DataFrame columns, as explained here, but you have to import the implicits package (I've edited my answer) – Haygood 2/8, 2016 at 13:42

Well, there is a "leftsemi" join option as well. It includes rows from the left table which have a matching row on the right. Also no need to dedup as the matching rows are included only once. – Orenorenburg 2/10, 2017 at 4:12

You could use left_semi:

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")

Semi-join takes only rows from the left dataset where joining condition is met.

There's also another interesting join type: left_anti, which works similarily to left_semi but takes only those rows where the condition is not met.

Defoe answered 30/5, 2019 at 15:13 Comment(0)

As an alternate answer, you could also do the following without adding aliases:

d1.join(d2, d1("id") === d2("id"))
  .select(d1.columns.map(c => d1(c)): _*)

Spake answered 30/1, 2018 at 11:12 Comment(1)

Hi @nsanglar, thanks, this was super helpful. However, if I have a derived column like .withColumn("derived_id", coalesce(d2("id"),d1("id"))) and I want this column along with all the columns of d1, then how will I modify the select statement? – Woodcraft 4/6, 2020 at 15:16

You should use leftsemi join which is similar to inner join difference being leftsemi join returns all columns from the left dataset and ignores all columns from the right dataset.

You can try something like the below in Scala to Join Spark DataFrame using leftsemi join types.

empDF.join(deptDF,empDF("emp_dept_id") ===  deptDF("dept_id"),"leftsemi")
    .show(false)

If you are using Python use below PySpark join dataframe example

empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftsemi") \
   .show(truncate=False)

Kacerek answered 27/10, 2020 at 14:31 Comment(0)

You can just select the columns from each dataframes that you need. df_1

col_a.  |.  col_b.  |. col_c.  |. col_d
v1a.    |   v2a.    |   v3a.   |.  v4a
v2a.    |.  v2b.    |.  v3b.   |.  v4b
....

And, df_2

col_a.  |.  col_b.  |. col_x.  |. col_y
v1a.    |   v2a.    |   vxa.   |.  vya
v2a.    |.  v2b.    |.  vxb.   |.  vyb
....

eg:

merged_df = (df_1.join(df_2, on=["col_a", "col_b"], how="inner")
                             .select(df_1["col_a"], df_1["col_b"], df_1["col_c"], df_2["col_x"], df_2["col_y"]))

Your result will be: merged_df

col_a.  |.  col_b.  |. col_c.  |. col_x.   |.  col_y
v1a.    |   v2a.    |   v3a.   |.  vxa.    |.   vya    
v2a.    |.  v2b.    |.  v3b.   |.  vxb.    |.   vyb
....

Iceskate answered 25/5 at 6:38 Comment(0)

Recommended topics

Hot tags