Let's assume we have the following spark dataframe
+---+------+---+
| id| name|age|
+---+------+---+
| 1|Andrew| 25|
| 1|Andrew| 25|
| 1|Andrew| 26|
| 2| Maria| 30|
+---+------+---+
distinct()
does not accept any arguments which means that you cannot select which columns need to be taken into account when dropping the duplicates. This means that the following command will drop the duplicate records taking into account all the columns of the dataframe:
df.distinct().show()
+---+------+---+
| id| name|age|
+---+------+---+
| 1|Andrew| 26|
| 2| Maria| 30|
| 1|Andrew| 25|
+---+------+---+
Now in case you want to drop the duplicates considering ONLY id
and name
you'd have to run a select()
prior to distinct()
. For example,
>>> df.select(['id', 'name']).distinct().show()
+---+------+
| id| name|
+---+------+
| 2| Maria|
| 1|Andrew|
+---+------+
But in case you wanted to drop the duplicates only over a subset of columns like above but keep ALL the columns, then distinct()
is not your friend.
dropDuplicates()
will drop the duplicates detected over the provided set of columns, but it will also return all the columns appearing in the original dataframe.
df.dropDuplicates().show()
+---+------+---+
| id| name|age|
+---+------+---+
| 1|Andrew| 26|
| 2| Maria| 30|
| 1|Andrew| 25|
+---+------+---+
dropDuplicates()
is thus more suitable when you want to drop duplicates over a selected subset of columns, but also want to keep all the columns:
df.dropDuplicates(['id', 'name']).show()
+---+------+---+
| id| name|age|
+---+------+---+
| 2| Maria| 30|
| 1|Andrew| 25|
+---+------+---+
For more details refer to the article distinct() vs dropDuplicates() in Python
age
in your example forname==Andrew
? – Complainant