Python data.table row filter by regex

Asked 10/2, 2019 at 21:29 Answered 22/7 at 15:11

What is the data.table for python equivalent of %like%?

Short example:

dt_foo_bar = dt.Frame({"n": [1, 3], "s": ["foo", "bar"]})  
dt_foo_bar[re.match("foo",f.s),:] #works to filter by "foo"

I had expected something like this to work:

dt_foo_bar[re.match("fo",f.s),:]

But it returns "expected string or bytes-like object". I'd love to start using the new data.tables package in Python the way I use it in R but I work a lot more with text data than numeric.

Thanks in advance.

Henleyonthames answered 10/2, 2019 at 21:29 Comment(0)

Since version 0.9.0, datatable contains function .re_match() which performs regular expression filtering. For example:

>>> import datatable as dt
>>> dt_foo_bar = dt.Frame(N=[1, 3, 5], S=["foo", "bar", "fox"])
>>> dt_foo_bar[dt.f.S.re_match("fo."), :]
     N  S  
--  --  ---
 0   1  foo
 1   5  fox

[2 rows x 2 columns]

In general, .re_match() applies to a column expression and produces a new boolean column indicating whether each value matches the given regular expression or not.

Saire answered 6/3, 2019 at 20:13 Comment(3)

I could not find this feature, or any string related data processing in the documentation. – Vesuvius 13/6, 2020 at 8:38

Do you also know how to generate a new column based on the regex? I use this at the moment, but it doesn't look like the best way wit the to_list conversion: DT['new_name'] = Frame([re.sub('some_regex_pattern','value_for_new_column', s) for s in DT[:, "column_for_regex"].to_list()[0]]) – Stanwin 15/6, 2020 at 11:33

@Stanwin I'm afraid that's not possible right now – Saire 16/6, 2020 at 22:1

Since version 1.0.0, datatable contains the .re.match() function which tests whether values in a string column match a regular expression. Sticking with @Pasha's example for consistency:

import datatable as dt


dt_foo_bar = dt.Frame(N=[1, 3, 5], S=["foo", "bar", "fox"])
dt_foo_bar[dt.re.match(dt.f.S, "fo."), :]
     N  S  
--  --  ---
 0   1  foo
 1   5  fox

[2 rows x 2 columns]

See datatable re.match docs for further details.

The older re_match function has been deprecated since version 1.0.0 and will be removed in 1.1.0.

Laugh answered 22/7 at 10:18 Comment(0)

We can do this with pandas df[df['column_name'].str.contains(pattern, regex=True)].

Example with pandas:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({"n": [1, 3], "s": ["foo", "bar"]})

# Filter rows where the 's' column contains the substring 'fo'
pattern = "r$"
filtered_df = df[df['s'].str.contains(pattern, regex=True)]

print(filtered_df)

OUTPUT:

   n    s
   3  bar

Explanation:

Create a DataFrame: Use pandas to create a DataFrame, similar to how you'd create a data.table in Python.

Filter Rows: Use str.contains() to filter rows based on whether the 's' column contains the specified pattern. The regex=True argument allows for regular expression matching.

Aeriela answered 22/7 at 15:11 Comment(0)

Recommended topics

Hot tags