How to filter observations for the multiple values passed in the I expression of Pydatatable frame?
Asked Answered
C

1

4

I have a data frame with two columns as shown below,

DT_EX = dt.Frame({'film':['Don','Warriors','Dragon','Chicago','Lion','Don','Chicago','Warriors'],
                  'gross':[400,500,600,100,200,300,900,1000]})

Here in first case i would like to filter the observations whose film is Don or Chicago as written in below code,

DT_EX[((f.film=="Don") | (f.film=="Chicago")),:]

In a second i would apply filter for 3 values as,

DT_EX[((f.film=="Don") | (f.film=="Chicago") | (f.film=="Lion")),:]

In case of filtering for more than 5 or 10 values, we are supposed to make a logical expression for these many values,and it would definatly be a time consuming task.

Is there any datatable way to get it done faster? like there are %in% %chin% kind of filtering options available in R data.table.

Calandra answered 29/4, 2020 at 5:36 Comment(2)
Could you please clarify what would be your expected output, and for users that don't use R (like me) wouldn't understand the question.Elysia
How to specify multiple filter values in the expression of I as per the pydatatable sytax DT[I,J,...]... In the mentioned examples you can see that the frame should return the observations matched with the film names such as Don, Chicago, Lion.. My requirement is that if i have to filter the observations for 10 values out of 50, how should i give in 10 values in a single expression?. in R, it goes like DT[film %in% c("required matches")].. I hope it's clear now.Calandra
R
6

Python equivalent of R's %in operator is called simply in. Unfortunately, this operator hasn't been implemented in datatable yet, the relevant feature request is https://github.com/h2oai/datatable/issues/699.

In the meanwhile, I'd recommend to use the standard reduce functor with or_ operator:

>>> import functools
>>> import operator
>>>
>>> films = ['Lion', 'Chicago', 'Don']
>>> filter = functools.reduce(operator.or_, (f.film == item for item in films))
>>> DT_EX[filter, :]
   | film     gross
-- + -------  -----
 0 | Don        400
 1 | Chicago    100
 2 | Lion       200
 3 | Don        300
 4 | Chicago    900

[5 rows x 2 columns]
Retrogressive answered 29/4, 2020 at 18:53 Comment(1)
In my case the list of logical expressions is too large, so accepted answer results in python kernel die. My workaround: https://mcmap.net/q/1274806/-py-datatable-39-in-39-operatorBastard

© 2022 - 2024 — McMap. All rights reserved.