I have a created two dataframes in pyspark from my hive table as:
data1 = spark.sql("""
SELECT ID, MODEL_NUMBER, MODEL_YEAR ,COUNTRY_CODE
from MODEL_TABLE1 where COUNTRY_CODE in ('IND','CHN','USA','RUS','AUS')
""");
each country is having millions of unique ID in alphanumeric format.
data2 = spark.sql("""
SELECT ID,MODEL_NUMBER, MODEL_YEAR, COUNTRY_CODE
from MODEL_TABLE2 where COUNTRY_CODE in ('IND','CHN')
""");
I want to join both of these dataframe using pyspark on ID column.
How can we re-partition our data so that its get distributed uniformly across the partitions.
Can i use below to reparation my data?
newdf1 = data2.repartition(100, "ID")
newdf2 = data2.repartition(100, "ID")
what would be the best way for partitioning so that join work faster?