Drop duplicate rows in python vaex
Asked Answered
A

2

8

I am working with python vaex, and I don't know how I can drop duplicate rows in a dataframe. For example in pandas there exists the method drop_duplicates(). Does there exist any similar function in vaex?

Alarmist answered 16/7, 2020 at 14:42 Comment(0)
T
2

It seems there is none yet, but we should expect this functionality at some point.

In the meantime, there is an attempt from the creator of vaex

Tasia answered 27/2, 2021 at 18:48 Comment(0)
P
1

I went with this groupby approach:

import vaex
df = vaex.from_arrays(x=[1, 2, 3, 4, 1, 2, 3, 4],
                      s=['a', 'b', 'c', 'd', 'A', 'b', 'c', 'D'],
                      q=[0, 0, 0, 0, 0, 1, 0, 0])
df['new'] = df.x
dfg = df.groupby(['x', 's', 'q']).agg({'new': "sum"})['x', 's', 'q']
dfg

So basically you add some sort of numeric column and then group over the original columns and sum on the new column and then just get rid of the new sum; leaving the unique (grouped) list of original columns.

Pluckless answered 10/12, 2021 at 16:7 Comment(3)
This works, but keep in mind that the output is in memory. If your group-by output is too big to fit in ram, this approach will not work.Hiller
Surely vaex does out of core, so too big to fit in ram is not an issue?Pluckless
It does, and the groupby aggregation is out of care also, but the resulting dataframe is in memory. So just be careful when doing groupbys with lots of columnsHiller

© 2022 - 2024 — McMap. All rights reserved.