I am working with python vaex, and I don't know how I can drop duplicate rows in a dataframe. For example in pandas there exists the method drop_duplicates()
. Does there exist any similar function in vaex?
Drop duplicate rows in python vaex
It seems there is none yet, but we should expect this functionality at some point.
In the meantime, there is an attempt from the creator of vaex
I went with this groupby
approach:
import vaex
df = vaex.from_arrays(x=[1, 2, 3, 4, 1, 2, 3, 4],
s=['a', 'b', 'c', 'd', 'A', 'b', 'c', 'D'],
q=[0, 0, 0, 0, 0, 1, 0, 0])
df['new'] = df.x
dfg = df.groupby(['x', 's', 'q']).agg({'new': "sum"})['x', 's', 'q']
dfg
So basically you add some sort of numeric column and then group over the original columns and sum on the new column and then just get rid of the new sum; leaving the unique (grouped) list of original columns.
This works, but keep in mind that the output is in memory. If your group-by output is too big to fit in ram, this approach will not work. –
Hiller
Surely
vaex
does out of core, so too big to fit in ram is not an issue? –
Pluckless It does, and the groupby aggregation is out of care also, but the resulting dataframe is in memory. So just be careful when doing groupbys with lots of columns –
Hiller
© 2022 - 2024 — McMap. All rights reserved.