Pandera validate get all valid rows

Asked 12/11, 2021 at 19:11 Answered 6/12, 2021 at 4:5

I am trying to use pandera library (I am very new with this) for pandas dataframe validation. What I want to do is to ignore the rows which are not valid as per the schema. How can I do that?

for example: pandera schema looks like below:

import pandera as pa
import pandas as pd

schema: pa.DataFrameSchema = pa.DataFrameSchema(columns={
  'Col1': pa.Column(str),
  'Col2': pa.Column(float, checks=pa.Check(lambda x: (0 <= x <= 1)), nullable=True),
})

df: pd.DataFrame = pd.DataFrame({
    "Col1": ["1", "2", "3", nan],
    "Col2": [0.3, 0.4, 5, 0.2],
})

What I want to do is when I apply validation on the df I get a result:

   Col1  Col2
0     1   0.3
1     2   0.4

The other rows with error dropped.

Optimism answered 12/11, 2021 at 19:11 Comment(1)

I created and added the pandera tag to this question, so @Prashant, please approve this suggested edit to get the new tag applied to your question. – Pseudoscope 12/11, 2021 at 19:31

pandera author here!

Currently you have to use a try except block with lazy validation. The SchemaErrors.failure_cases df doesn't always have an index in certain cases, like if the column's type is incorrect. The index only applies to checks that produce an index-aligned boolean dataframe/series.

By default the check_fn function fed into pa.Check should take a pandas Series as input. I fixed your custom check like so:

import pandera as pa
import pandas as pd
import numpy as np

schema: pa.DataFrameSchema = pa.DataFrameSchema(columns={
  'Col1': pa.Column(str),
  'Col2': pa.Column(
      float, checks=pa.Check(lambda series: series.between(0, 1)), nullable=True
    ),
})

df: pd.DataFrame = pd.DataFrame({
    "Col1": ["1", "2", "3", np.nan],
    "Col2": [0.3, 0.4, 5, 0.2],
})

try:
    schema(df, lazy=True)
except pa.errors.SchemaErrors as exc:
    filtered_df = df[~df.index.isin(exc.failure_cases["index"])]

print(f"filtered df:\n{filtered_df}")

Output:

filtered df:
  Col1  Col2
0    1   0.3
1    2   0.4

To check value ranges I'd recommend using the built-in pa.Check.in_range check.

In other cases, just be aware of the element_wise=True kwarg in pa.Check, it modifies the expected type signature of the check_fn arg.

Stair answered 6/12, 2021 at 4:5 Comment(1)

how come there isn't an index provided for rows that fail the initial type check? i want to be able to remove those from the dataframe as well – Willawillabella 25/1, 2023 at 16:43

This is the first time I heard of Pandera but it looks like a cool library. After a bit of digging around, you can catch the validation error and filter out the failure indexes:

fail_index = []
try:
    schema.validate(df)
except pa.errors.SchemaError as ex:
    fail_index = ex.failure_cases['index']

clean_df = df[~df.index.isin(fail_index)]

Documentation about SchemaError: https://pandera.readthedocs.io/en/stable/reference/generated/pandera.errors.SchemaError.html#pandera.errors.SchemaError

Dirkdirks answered 13/11, 2021 at 3:10 Comment(1)

Thanks for replying. I tried that already but there are two things: 1. SchemaError Exception is caught as soon as first invalid row is encountered. 2. There is a lazy mode as well in validate which throws pa.errors.SchemaErrors after validating all rows. But the failure_cases not always contain index. It sometimes is None. – Optimism 13/11, 2021 at 3:36

I have the same issue with failure_cases not always having an index - might be a bug (sorry for replying with an answer, I have no reputation).
Here's a minimal reproduction:

import pandas, pandera

df = pandas.DataFrame({"c1": ["9"]})
# other checks also fail, e.g.:
# pandera.Column(str, checks=pandera.Check.le(10))
schema = pandera.DataFrameSchema({"c1": pandera.Column(int)})

try:
    schema.validate(df, lazy=True)
except pandera.errors.SchemaErrors as err:
    print(err.failure_cases)

Output:

  schema_context column           check check_number failure_case index
0         Column     c1  dtype('int64')         None       object  None

I would expect the index to be 0 here, not None. I suspect pandera.Checks do something special which doesn't happen for data type mismatch errors or for errors that occur before the Check finishes (such as the TypeError("operator not supported... you get with the str le check).

Stranger answered 2/12, 2021 at 3:30 Comment(0)

Recommended topics

Hot tags