Drop duplicate rows in python vaex

About

Asked 16/7, 2020 at 14:42 Answered 10/12, 2021 at 16:7

I am working with python vaex, and I don't know how I can drop duplicate rows in a dataframe. For example in pandas there exists the method drop_duplicates(). Does there exist any similar function in vaex?

Alarmist answered 16/7, 2020 at 14:42 Comment(0)

It seems there is none yet, but we should expect this functionality at some point.

In the meantime, there is an attempt from the creator of vaex

Tasia answered 27/2, 2021 at 18:48 Comment(0)

I went with this groupby approach:

import vaex
df = vaex.from_arrays(x=[1, 2, 3, 4, 1, 2, 3, 4],
                      s=['a', 'b', 'c', 'd', 'A', 'b', 'c', 'D'],
                      q=[0, 0, 0, 0, 0, 1, 0, 0])
df['new'] = df.x
dfg = df.groupby(['x', 's', 'q']).agg({'new': "sum"})['x', 's', 'q']
dfg

So basically you add some sort of numeric column and then group over the original columns and sum on the new column and then just get rid of the new sum; leaving the unique (grouped) list of original columns.

Pluckless answered 10/12, 2021 at 16:7 Comment(3)

This works, but keep in mind that the output is in memory. If your group-by output is too big to fit in ram, this approach will not work. – Hiller 18/12, 2021 at 22:41

Surely vaex does out of core, so too big to fit in ram is not an issue? – Pluckless 20/12, 2021 at 5:50

It does, and the groupby aggregation is out of care also, but the resulting dataframe is in memory. So just be careful when doing groupbys with lots of columns – Hiller 20/12, 2021 at 9:12

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags