How to check for intersection of two DataFrame columns in Spark
Asked Answered
F

1

9

Using either pyspark or sparkr (preferably both), how can I get the intersection of two DataFrame columns? For example, in sparkr I have the following DataFrames:

newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
                       surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
                        surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)

#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)

        name  surname
    1 George Williams

#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)

Error in as.vector(y) : no method for coercing this S4 class to a vector

How can I get intersect to work for single columns?

Fiat answered 24/5, 2017 at 21:0 Comment(1)
working fine in pyspark spark.createDataFrame(["a","b","x"],StringType()).intersect(spark.createDataFrame(["z","y","x"],StringType()))Cumbersome
C
17

You need two Spark DataFrames to make use of the intersect function. You can use select function to get specific columns from each DataFrame.

In SparkR:

newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))

In pyspark:

newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name')) 
Cockburn answered 25/5, 2017 at 8:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.