Return elements of Redshift JSON array on separate rows

Asked 12/5, 2015 at 15:48 Answered 7/3, 2022 at 18:30

I have a Redshift table that looks like this:

 id | metadata
---------------------------------------------------------------------------
 1  | [{"pet":"dog"},{"country":"uk"}]
 2  | [{"pet":"cat"}]
 3  | []
 4  | [{"country":"germany"},{"education":"masters"},{"country":"belgium"}]

All array elements have just one field.
There is no guarantee that a particular field will feature in any of an array's elements.
A field name can be repeated in an array
The array elements can be in any order

I am wanting to get back a table that looks like this:

 id |   field   |  value
------------------------
 1  | pet       | dog
 1  | country   | uk
 2  | pet       | cat
 4  | country   | germany
 4  | education | masters
 4  | country   | belgium

I can then combine this with my queries on the rest of the input table.

I have tried playing around with the Redshift JSON functions, but without being able to write functions/use loops/have variables in Redshift, I really can't see a way to do this!

Please let me know if I can clarify anything else.

Grits answered 12/5, 2015 at 15:48 Comment(0)

Thanks to this inspired blog post, I've been able to craft a solution. This is:

Create a look-up table to effectively 'iterate' over the elements of each array. The number of rows in this table has be equal to or greater than the maximum number of elements of arrays. Let's say this is 4 (it can be calculated using SELECT MAX(JSON_ARRAY_LENGTH(metadata)) FROM input_table):
```
CREATE VIEW seq_0_to_3 AS
    SELECT 0 AS i UNION ALL                                      
    SELECT 1 UNION ALL
    SELECT 2 UNION ALL    
    SELECT 3          
);
```

From this, we can create one row per JSON element:

WITH exploded_array AS (                                                                          
    SELECT id, JSON_EXTRACT_ARRAY_ELEMENT_TEXT(metadata, seq.i) AS json
    FROM input_table, seq_0_to_3 AS seq
    WHERE seq.i < JSON_ARRAY_LENGTH(metadata)
  )
SELECT *
FROM exploded_array;

Producing:

 id | json
------------------------------
 1  | {"pet":"dog"}
 1  | {"country":"uk"}
 2  | {"pet":"cat"}
 4  | {"country":"germany"}
 4  | {"education":"masters"}
 4  | {"country":"belgium"}

However, I was needing to extract the field names/values. As I can't see any way to extract JSON field names using Redshift's limited functions, I'll do this using a regular expression:

WITH exploded_array AS (                                                                                       
    SELECT id, JSON_EXTRACT_ARRAY_ELEMENT_TEXT(metadata, seq.i) AS json
    FROM input_table, seq_0_to_3 AS seq
    WHERE seq.i < JSON_ARRAY_LENGTH(metadata)
)
SELECT id, field, JSON_EXTRACT_PATH_TEXT(json, field)
FROM (
    SELECT id, json, REGEXP_SUBSTR(json, '[^{"]\\w+[^"]') AS field
    FROM exploded_array
);

Grits answered 12/5, 2015 at 18:26 Comment(0)

It's now possible in Redshift to treat strings in either array format [] or json format {} as parsable json structures. First let's make a temp table based on your data:

create temporary table #t1 (id int, json_str varchar(100));
truncate table #t1;
insert into #t1 values (1, '[{"pet":"dog"},{"country":"uk"}]');
insert into #t1 values (2, '[{"pet":"cat"}]');
insert into #t1 values (3, '[]');
insert into #t1 values (4, '[{"country":"germany"},{"education":"masters"},{"country":"belgium"}]');

This creation of a common table expression (cte) will be used to implicitly convert the json_str field into a formal json structure of SUPER type. If the table's field were already SUPER type, we could skip this step.

drop table if exists #t2;
create temporary table #t2 as
with cte as 
    (select 
        x.id,
        json_parse(x.json_str) as json_str -- convert string to SUPER structure 
    from
        #t1 x
    )
select
    x.id
    ,unnested
from
    cte x, x.json_str as unnested -- an alias of cte and x.json_str is required!
order by 
    id
;

Now we have an exploded list of key/value pairs to easily extract:

select 
    t2.id
    ,json_key -- this is the extracted key
    ,cast(json_val as varchar) as json_val -- eleminates the double quote marks
from
    #t2 t2, unpivot t2.unnested as json_val at json_key --"at some_label" (e.g. json_key) will extract the key
order by
    id

A different way to render the info is to allow the parsing engine to turn keys into columns. This isn't what you asked for, but potentially interesting:

select 
    id
    ,cast(t2.unnested.country as varchar) -- data is already parsed into rows, so it's directly addressable now
    ,cast(t2.unnested.education as varchar)
    ,cast(t2.unnested.pet as varchar)
from
    #t2 t2
;

If you want more info on this, use a search engine to search for parsing the SUPER data type. If the data already existed as SUPER in the Redshift table, these latter 2 queries would work natively against the table, no need for a temp table.

Rimini answered 7/3, 2022 at 18:30 Comment(1)

This is very interesting to me. Thanks for adding after such a long time! You say the last query should work without the first, but you use t2.unnested in the last query. But unnested doesn't exist unless we run the first query. Could you add an update to show how to make the queries work directly with your source table? – Bran 19/8, 2022 at 7:27

There is generic version for CREATE VIEW seq_0_to_3. Let's call it CREATE VIEW seq_0_to_n. This can be generated by

CREATE VIEW seq_0_to_n AS (  
    SELECT row_number() over (
                          ORDER BY TRUE)::integer - 1 AS i
    FROM <insert_large_enough_table> LIMIT <number_less_than_table_entries>);

This helps in generating large sequences as a view.

Leek answered 2/12, 2016 at 11:33 Comment(0)

Recommended topics

Hot tags