py-datatable 'in' operator?
Asked Answered
H

4

8

I am unable to perform a standard in operation with a pre-defined list of items. I am looking to do something like this:

# Construct a simple example frame
from datatable import *
df = Frame(V1=['A','B','C','D'], V2=[1,2,3,4])

# Filter frame to a list of items (THIS DOES NOT WORK)
items = ['A','B']
df[f.V1 in items,:]

This example results in the error:

TypeError: A boolean value cannot be used as a row selector

Unfortunately, there doesn't appear to be a built-in object for in operations. I would like to use something like the %in% operator that is native to the R language. Is there any method for accomplishing this in python?

I can take this approach with the use of multiple 'equals' operators, but this is inconvenient when you want to consider a large number of items:

df[(f.V1 == 'A') | (f.V1 == 'B'),:]

datatable 0.10.1
python 3.6

Hewie answered 14/6, 2020 at 22:6 Comment(3)
Pandas has Series.isin for this, but it doesn't look like datatable has anything similar. (The datatable documentation seems really sparse.)Farver
@martineau: The question needs a version of in that broadcasts over the LHS, which Python's in can't do.Farver
See also: stackoverflow.com/questions/61494957Towboat
C
6

You could also try this out:

First import all the necessary packages as,

import datatable as dt
from datatable import by,f,count
import functools
import operator

Create a sample datatable:

DT = dt.Frame(V1=['A','B','C','D','E','B','A'], V2=[1,2,3,4,5,6,7])

Make a list of values to be filtered among the observations, in your case it is

sel_obs = ['A','B']

Now create a filter expression using funtools and operators modules,

filter_rows = functools.reduce(operator.or_,(f.V1==obs for obs in sel_obs))

Finally apply the above created filter on datatable

DT[fil_rows,:]

its output as-

Out[6]: 
   | V1  V2
-- + --  --
 0 | A    1
 1 | B    2
 2 | B    6
 3 | A    7

[4 rows x 2 columns]

You can just play around with operators to do different type of filterings.

@sammyweemy's solution should also work.

Charis answered 16/6, 2020 at 2:55 Comment(5)
nice play with functools. The next step will probably be speed tests, as pydatatable's primary aim (i might be wrong) is speed.Marte
This is much faster as the number of rows increase.Marte
The datatable goal as stated by the original author Matt Dowle is "It provides a high-performance version of base R's data.frame with syntax and feature enhancements for ease of use, convenience and programming speed.". Although this requires the use of two other modules, I think it's the best answer in the interim as the pydatatable team develops the capability to use the in operator.Hewie
still too slow to filter 1000 out of 1M row - quite disappointingUncomfortable
@Uncomfortable how slow is slow? what's the benchmark? what are you comparing it against?Marte
P
6

It turns out that when you pass a list of expressions to python datatable, it will evaluate them as or.

So you can just do:

import datatable
df = datatable.Frame(V1=['A','B','C','D'], V2=[1,2,3,4])

items = ['A','B']
df[[datatable.f.V1 == i for i in items],:]

Note that there are some considerations for this: it's not described in the docs and I absolutely don't know if it will always work. Moreover, it also work only to filter one column - if you would try to filter rows where V1==A or V2==1 the approach with list would create duplicates.

If you would need to do some fancier filtering you can just adjust the filter expression inside the list, such as:

df[([(datatable.f.V1 == i) & (datatable.f.V2 >= 2) for i in items]),:]

Which will return just the second row from the example, as expected.

Pivot answered 29/4, 2021 at 16:32 Comment(2)
Any idea why this doesn't work for != ? ie df[[datatable.f.V1 != i for i in items],:] Would expect to keep rows where V1 is 'C' and 'D' instead looks like we append two extra rows?Avelin
@Avelin I was rather lazy with my wording when i wrote this answer it seems. In your case it basically filters all the rows where V1 is not A, and then where V1 is not B. You can run df[:,[datatable.f.V1 != i for i in items]] to get better idea what the exression returns. In your case, you can do something like: df[sum([(datatable.f.V1 != i) for i in items]) == len(items),:] or perhaps df[[ i not in items for i in df[:, 'V1'].to_numpy()],:], if converting one column to numpy is not an issue. Note that there is a section on filtering datatables via list comprehension in the docs now:Pivot
M
3

I have not found an in function in pydatatable; however, there is a hack for your use case:

items = ['A','B']
regex = f"{'|'.join(items)}"
df[f.V1.re_match(regex),:]


   V1   V2
  ▪▪▪▪  ▪▪▪▪
0   A   1
1   B   2
2 rows × 2 columns

This is from an answer to a different question : link. I could not find this function in the docs either. Hopefully, with time, documentation will improve, and there will be more functions included as well.

Marte answered 15/6, 2020 at 9:40 Comment(1)
re_match is deprecated. Use re.match instead: datatable.readthedocs.io/en/latest/api/re/match.htmlEscallop
M
0

In my case the list of items is very large, so accepted answer results in python kernel die. My workaround is to create temporary Frame with the list as a key and a column with any value for all rows, then left join my Frame with temporary Frame and filter all rows that are is not NA:

DT1 = dt.Frame(A = ['a', 'b', 'c', 'd'])
TEMP_DT = dt.Frame(A=['a', 'b'], FOO=[1, 1])
TEMP_DT.key = 'A'

DT1[:, :, join(TEMP_DT)][~dt.isna(f.FOO),:]
Magog answered 25/8, 2023 at 20:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.