Drop a column with same name using column index in pyspark

Asked 18/12, 2019 at 18:35 Answered 15/5 at 22:23

This is my dataframe I'm trying to drop the duplicate columns with same name using index:

df = spark.createDataFrame([(1,2,3,4,5)],['c','b','a','a','b'])
df.show()

Output:

+---+---+---+---+---+
|  c|  b|  a|  a|  b|
+---+---+---+---+---+
|  1|  2|  3|  4|  5|
+---+---+---+---+---+

I got the index of the dataframe

col_dict = {x: col for x, col in enumerate(df.columns)}
col_dict

Output:

{0: 'c', 1: 'b', 2: 'a', 3: 'a', 4: 'b'}

Now i need to drop that duplicate column name with the same name

Ascribe answered 18/12, 2019 at 18:35 Comment(0)

There is no method for droping columns using index. One way for achieving this is to rename the duplicate columns and then drop them.

Here is an example you can adapt:

df_cols = df.columns
# get index of the duplicate columns
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))

# rename by adding suffix '_duplicated'
for i in duplicate_col_index:
    df_cols[i] = df_cols[i] + '_duplicated'

# rename the column in DF
df = df.toDF(*df_cols)

# remove flagged columns
cols_to_remove = [c for c in df_cols if '_duplicated' in c]
df.drop(*cols_to_remove).show()

+---+---+---+
|  c|  a|  b|
+---+---+---+
|  1|  4|  5|
+---+---+---+

Schlock answered 18/12, 2019 at 22:48 Comment(0)

@blackbishop's answer is a good one. I voted for it. However, there is one potential issue. If you have a unique column with a name like a_duplicated it will fail. This is unlikely, but with large volumes of user submitted info it is a concern and should not be neglected.

df_cols = df.columns
# get index of the duplicate columns
duplicate_col_index = list(set([df_cols.index(c) for c in df_cols if df_cols.count(c) == 2]))

# rename by adding suffix '_duplicated'
for i in duplicate_col_index:
    df_cols[i] = df_cols[i] + '_duplicated'

# rename the column in DF
df = df.toDF(*df_cols)

# remove flagged columns
cols_to_remove = [df_cols[i] for i in duplicate_col_index] # <--- Change!!! 

df.drop(*cols_to_remove).show()

Valentine answered 15/5 at 22:23 Comment(0)

Recommended topics

Hot tags