Redshift/Postgres: how can I ignore rows that generate errors? (Invalid JSON in json_extract_path_text)
Asked Answered
W

9

10

I'm trying to run a query in redshift where I'm selecting using json_extract_path_text. Unfortunately, some of the JSON entries in this database column are invalid.

What happens: When the query hits an invalid JSON value, it stops with a "JSON parsing error".

What I want: Ignore any rows with invalid JSON in that column, but return any rows where it can parse the JSON.

Why I can't make it do what I want: I don't think I understand error handling in Redshift/Postgres. It should be possible to simply skip any rows that generate errors, but I tried entering EXEC SQL WHENEVER SQLERROR CONTINUE (based on the Postgres docs) and got a "syntax error at or near SQLERROR".

Wyler answered 14/8, 2014 at 21:14 Comment(6)
json's either valid, or it isn't. you can't just "skip" a corrupted section, because that might just screw up everything else after that point. you would need to correct the bad parts yourself, or fix the source that's generating the bad json.Pyelonephritis
I'm not skipping a section of JSON, I want to skip the entire database row which contains the corrupt JSON. Or really, just skip any row that generates errors during processing.Wyler
Er ... Redshift doesn't have json features, does it? What's the output of SELECT version() ? Are you sure you're not using Amazon RDS PostgreSQL 9.3 instead? As for the command you supplied, that's for Embedded SQL in C (ecpg), and doesn't apply here at all. You'd want a PL/PgSQL BEGIN ... EXCEPTION block.Avert
This is JSON in a VARCHAR(65535) -- which I did not design. The function I'm talking about is indeed part of redshift. If someone wrote up an example of the BEGIN ... EXCEPTION block I'd probably accept it.Wyler
Running into same problem, need to skip bad json rows, did you figure out any more on this?Preinstruct
@KyleGobel, the best way seems to be to wrap the json call in a new function that uses an exception block, similar to this. Unfortunately for my particular case, I was using a third-party interface (Periscope) that didn't allow creating functions, and I have been forced to move on to another task leaving this problem unresolved for the time being.Wyler
J
11

Create a python UDF:

create or replace function f_json_ok(js varchar(65535)) 
returns boolean
immutable
as $$
    if js is None: 
        return None

    import json
    try:
        json.loads(js)
        return True
    except:
        return False
$$ language plpythonu

Use it like so:

select *
from schema.table
where 'DesiredValue' = 
    case 
        when f_json_ok(json_column) then json_extract_path_text(json_column, 'Key') 
        else 'nope' 
    end 
Jog answered 25/5, 2016 at 15:45 Comment(3)
Currently, if NULL is passed into the function, the function would return False. However, in databases, I think NULL should remain NULL. So in my version, after the import, I added if js is None: return None.Entresol
Can't you change this to js text instead of varchar? also parsed = is not needed, you're doing nothing with it, why assign it?Raviv
@JavierBuzzi no: You can create an Amazon Redshift table with a TEXT column, but it is converted to a VARCHAR(256) column that accepts variable-length values with a maximum of 256 characters. docs.aws.amazon.com/redshift/latest/dg/r_Character_types.htmlJog
S
6

Edit: it seems like Redshift only supports Python UDFs so this answer will not work. I'm going to leave this answer here for posterity (and in the event someone finds this who isn't using Redshift).

Potentially relevant: here is a plpgsql function which will try to decode JSON and return a default value if that fails:

CREATE OR REPLACE FUNCTION safe_json(i text, fallback json) RETURNS json AS $$
BEGIN
    RETURN i::json;
EXCEPTION
    WHEN others THEN
        RETURN fallback;
END;
$$ LANGUAGE plpgsql IMMUTABLE RETURNS NULL ON NULL INPUT;

Then you can use it like this:

SELECT
    …
FROM (
    SELECT safe_json(my_text, '{"error": "invalid JSON"}'::json) AS my_json
    FROM my_table
) as x

To guarantee that you'll always have valid JSON

Sulfanilamide answered 26/7, 2015 at 14:4 Comment(2)
This apparently doesn't work in redshift because "[0A000] ERROR: Create Function is not supported for language plpgsql."Hale
Whoops. That's annoying. I've added a note that this will not work on Redshift (but left it around in case it's useful for other people).Sulfanilamide
P
2

I assume the JSON data is actually stored in a TEXT column rather than a JSON column (otherwise you would not have been able to store non-JSON in there in the first place).

If there is some pattern to the data that would allow you to make a regex that detects the valid rows, or the invalid ones, then you could use a CASE statement. For example:

SELECT CASE
    WHEN mycol !~ 'not_json' THEN json_extract_path_text(mycol, ....)
    ELSE NULL
END AS mystuff
...

replacing not_json with a regex that detects the non-JSON formatted values.

This may or may not be practical depending on the format of your data.

According to the answers on this question it is apparently possible to completely verify arbitrary JSON data using some regex implementations, but alas not the one used by postgresql.

Prerecord answered 15/8, 2014 at 7:27 Comment(1)
I'd prefer not to separately validate the JSON with regex, partly because I don't think there would be a simple expression that would work, and partly because I think there should be a way to just ignore the rows after an error. Still I'm giving you an upvote for actually trying to understand the question and not just assuming I'm confused about what Redshift and JSON are.Wyler
H
2

Update: The UDF solution seems perfect. At the time I wrote this, that answer wasn't there. This one is just some work around methods.

Although json_extract_path_text can't ignore errors, but Redshift's COPY have a MAXERROR Parameter.

So, you can use something like this instead:

COPY raw_json FROM 's3://data-source' 
CREDENTIALS 'aws_access_key_id;aws_secret_access_key'
JSON 's3://json_path.json'
MAXERROR 1000;

The next pitfall is in json_path.json file: you can't use a $ to specify the root element:

{
    "jsonpaths": [
        "$['_id']",
        "$['type']",
        "$" <--------------- this will fail.
    ]
}

So, it would be convenience to have a "top-level" element containing other fields, like this: (So, $['data'] is everything on your record)

{
    "data": {
        "id": 1
        ...
    }
}
{
    "data": {
        "id": 2
        ...
    }
}

If you can't change the source format, Redshift's UNLOAD will help:

UNLOAD ('select_statement')
TO 's3://object_path_prefix'

It's easy to use a select_statement to concatenate: { "data" : + old string + }...

Then, Redshift rocks again!

Hardman answered 9/12, 2015 at 6:38 Comment(0)
Y
1

Redshift is missing a lot of Postgres functions such as error handling.

The way I handle this is:

  1. Use CREATE TABLE AS to create a 'fixup' table with the JSON field and whatever the key is on the main table you're trying to query. Make sure you set the DISTKEY and SORTKEY to your JSON field.

  2. Add two columns to my fixup table: valid_json (BOOLEAN) and an extract_test(VARCHAR)

  3. Try to UPDATE extract_test with some text from the JSON field using JSON_EXTRACT_PATH_TEXT.

  4. Use the errors from that to spot common characters which are screwing up the JSON. If I'm importing from web log data, I might find ???? or something similar

  5. Use UPDATE table SET valid_json = false for JSON fields with that value

  6. Finally, change the json fields in my original table using UPDATE c SET json_field = NULL FROM fixup_table f WHERE original_table.id = f.id AND f.valid_json = FALSE

It's still manual, but far quicker than fixing line by line on a big table, and by using the right DISTKEY/SORTKEY on your fixup table you can make the queries run quickly.

Yordan answered 17/4, 2015 at 0:46 Comment(0)
M
1

Redshift now supports passing a boolean argument that allows you to consider invalid JSON as null

select json_extract_path_text('invalid', 'path', true)

returns null

https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_PATH_TEXT.html

Mithridate answered 30/1, 2018 at 9:34 Comment(0)
A
0

You can use the following function:

CREATE OR REPLACE FUNCTION isValidJSONv2(i varchar(MAX)) RETURNS int stable AS $CODE$
import json
import sys
try:
    if i is None:
        return 0
    json_object = json.loads(i)
    return 1
except:
    return 0
$CODE$ language plpythonu;

The problem still that if you still use the json parsing functions in the select, the error is still thrown. You would have to filter the valid from the unvalid jsons in different tables. I have posted the issue here: https://forums.aws.amazon.com/thread.jspa?threadID=232468

And answered 27/5, 2016 at 11:58 Comment(0)
V
0

@Puneet's answer looked promising but it still failed when getting tripped up on "null" despite pre-processing the data in Python using json.dumps(). I ended up doing this:

,case 
    when context like '{"symptomId": null%' then null 
    when context like '{"symptomId%' then cast(json_extract_path_text(context, 'symptomId', true) as int) 
    else null
end as symptom_id

My concern is that some other pattern will emerge that also breaks. Maybe then the bool flag of "true" will catch such a scenario.

Varden answered 20/9, 2021 at 20:32 Comment(0)
R
0

Amazon Redshift now supports a final null_if_invalid argument in json_extract_path_text to support the desired behavior.

For example, to parse a json_column column and extract the my_field key from it, and ignore errors, use the following:

select json_extract_path_text(json_column, 'my_field', true) from my_table;

Redshift Reference

Reticle answered 26/11, 2021 at 8:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.