py-datatable 'in' operator?

Asked 14/6, 2020 at 22:6 Answered 25/8, 2023 at 20:32

Solved python python-3.x datatable py-datatable

I am unable to perform a standard in operation with a pre-defined list of items. I am looking to do something like this:

# Construct a simple example frame
from datatable import *
df = Frame(V1=['A','B','C','D'], V2=[1,2,3,4])

# Filter frame to a list of items (THIS DOES NOT WORK)
items = ['A','B']
df[f.V1 in items,:]

This example results in the error:

TypeError: A boolean value cannot be used as a row selector

Unfortunately, there doesn't appear to be a built-in object for in operations. I would like to use something like the %in% operator that is native to the R language. Is there any method for accomplishing this in python?

I can take this approach with the use of multiple 'equals' operators, but this is inconvenient when you want to consider a large number of items:

df[(f.V1 == 'A') | (f.V1 == 'B'),:]

datatable 0.10.1
python 3.6

Hewie answered 14/6, 2020 at 22:6 Comment(3)

Pandas has Series.isin for this, but it doesn't look like datatable has anything similar. (The datatable documentation seems really sparse.) – Farver 14/6, 2020 at 23:29

@martineau: The question needs a version of in that broadcasts over the LHS, which Python's in can't do. – Farver 14/6, 2020 at 23:29

See also: stackoverflow.com/questions/61494957 – Towboat 17/6, 2020 at 22:20

You could also try this out:

First import all the necessary packages as,

import datatable as dt
from datatable import by,f,count
import functools
import operator

Create a sample datatable:

DT = dt.Frame(V1=['A','B','C','D','E','B','A'], V2=[1,2,3,4,5,6,7])

Make a list of values to be filtered among the observations, in your case it is

sel_obs = ['A','B']

Now create a filter expression using funtools and operators modules,

filter_rows = functools.reduce(operator.or_,(f.V1==obs for obs in sel_obs))

Finally apply the above created filter on datatable

DT[fil_rows,:]

its output as-

Out[6]: 
   | V1  V2
-- + --  --
 0 | A    1
 1 | B    2
 2 | B    6
 3 | A    7

[4 rows x 2 columns]

You can just play around with operators to do different type of filterings.

@sammyweemy's solution should also work.

Charis answered 16/6, 2020 at 2:55 Comment(5)

nice play with functools. The next step will probably be speed tests, as pydatatable's primary aim (i might be wrong) is speed. – Marte 16/6, 2020 at 7:27

This is much faster as the number of rows increase. – Marte 16/6, 2020 at 7:35

The datatable goal as stated by the original author Matt Dowle is "It provides a high-performance version of base R's data.frame with syntax and feature enhancements for ease of use, convenience and programming speed.". Although this requires the use of two other modules, I think it's the best answer in the interim as the pydatatable team develops the capability to use the in operator. – Hewie 18/6, 2020 at 0:51

still too slow to filter 1000 out of 1M row - quite disappointing – Uncomfortable 1/6, 2021 at 12:6

@Uncomfortable how slow is slow? what's the benchmark? what are you comparing it against? – Marte 11/7, 2022 at 14:32

It turns out that when you pass a list of expressions to python datatable, it will evaluate them as or.

So you can just do:

import datatable
df = datatable.Frame(V1=['A','B','C','D'], V2=[1,2,3,4])

items = ['A','B']
df[[datatable.f.V1 == i for i in items],:]

Note that there are some considerations for this: it's not described in the docs and I absolutely don't know if it will always work. Moreover, it also work only to filter one column - if you would try to filter rows where V1==A or V2==1 the approach with list would create duplicates.

If you would need to do some fancier filtering you can just adjust the filter expression inside the list, such as:

df[([(datatable.f.V1 == i) & (datatable.f.V2 >= 2) for i in items]),:]

Which will return just the second row from the example, as expected.

Pivot answered 29/4, 2021 at 16:32 Comment(2)

Any idea why this doesn't work for != ? ie df[[datatable.f.V1 != i for i in items],:] Would expect to keep rows where V1 is 'C' and 'D' instead looks like we append two extra rows? – Avelin 11/8, 2021 at 20:0

@Avelin I was rather lazy with my wording when i wrote this answer it seems. In your case it basically filters all the rows where V1 is not A, and then where V1 is not B. You can run df[:,[datatable.f.V1 != i for i in items]] to get better idea what the exression returns. In your case, you can do something like: df[sum([(datatable.f.V1 != i) for i in items]) == len(items),:] or perhaps df[[ i not in items for i in df[:, 'V1'].to_numpy()],:], if converting one column to numpy is not an issue. Note that there is a section on filtering datatables via list comprehension in the docs now: – Pivot 12/8, 2021 at 10:16

I have not found an in function in pydatatable; however, there is a hack for your use case:

items = ['A','B']
regex = f"{'|'.join(items)}"
df[f.V1.re_match(regex),:]


   V1   V2
  ▪▪▪▪  ▪▪▪▪
0   A   1
1   B   2
2 rows × 2 columns

This is from an answer to a different question : link. I could not find this function in the docs either. Hopefully, with time, documentation will improve, and there will be more functions included as well.

Marte answered 15/6, 2020 at 9:40 Comment(1)

re_match is deprecated. Use re.match instead: datatable.readthedocs.io/en/latest/api/re/match.html – Escallop 25/7 at 10:31

In my case the list of items is very large, so accepted answer results in python kernel die. My workaround is to create temporary Frame with the list as a key and a column with any value for all rows, then left join my Frame with temporary Frame and filter all rows that are is not NA:

DT1 = dt.Frame(A = ['a', 'b', 'c', 'd'])
TEMP_DT = dt.Frame(A=['a', 'b'], FOO=[1, 1])
TEMP_DT.key = 'A'

DT1[:, :, join(TEMP_DT)][~dt.isna(f.FOO),:]

Magog answered 25/8, 2023 at 20:32 Comment(0)

Recommended topics

Hot tags