Redshift. Convert comma delimited values into rows
Asked Answered
K

10

41

I am wondering how to convert comma-delimited values into rows in Redshift. I am afraid that my own solution isn't optimal. Please advise. I have table with one of the columns with coma-separated values. For example:

I have:

user_id|user_name|user_action
-----------------------------
1      | Shone   | start,stop,cancell...

I would like to see

user_id|user_name|parsed_action 
------------------------------- 
1      | Shone   | start        
1      | Shone   | stop         
1      | Shone   | cancell      
....
Kerin answered 4/8, 2014 at 5:43 Comment(1)
Please see a viable answer to this question here https://mcmap.net/q/393277/-redshift-split-single-dynamic-column-into-multiple-rows-in-new-table/46785509#46785509Betteanne
B
38

A slight improvement over the existing answer is to use a second "numbers" table that enumerates all of the possible list lengths and then use a cross join to make the query more compact.

Redshift does not have a straightforward method for creating a numbers table that I am aware of, but we can use a bit of a hack from https://www.periscope.io/blog/generate-series-in-redshift-and-mysql.html to create one using row numbers.

Specifically, if we assume the number of rows in cmd_logs is larger than the maximum number of commas in the user_action column, we can create a numbers table by counting rows. To start, let's assume there are at most 99 commas in the user_action column:

select 
  (row_number() over (order by true))::int as n
into numbers
from cmd_logs
limit 100;

If we want to get fancy, we can compute the number of commas from the cmd_logs table to create a more precise set of rows in numbers:

select
  n::int
into numbers
from
  (select 
      row_number() over (order by true) as n
   from cmd_logs)
cross join
  (select 
      max(regexp_count(user_action, '[,]')) as max_num 
   from cmd_logs)
where
  n <= max_num + 1;

Once there is a numbers table, we can do:

select
  user_id, 
  user_name, 
  split_part(user_action,',',n) as parsed_action 
from
  cmd_logs
cross join
  numbers
where
  split_part(user_action,',',n) is not null
  and split_part(user_action,',',n) != '';
Barring answered 13/8, 2015 at 21:35 Comment(2)
What if the assumption doesn't hold, ie, the maximum number of commas in the user_action equivalent is greater than the number of rows in cmd_logs equivalent? This is true in my case and I end up losing values -- for reference the maximum number of commas I have is ~5,000Brummell
@Brummell You can always just SELECT from a larger table that will have the amount of rows you need -- whatever table you use is pretty much a dummy table anyway. If your Redshift has the generate_series() function (see this question), you can just SELECT generate_series AS n FROM generate_series(1, 10000). Alternatively, see this answerPenstemon
U
2

Another idea is to transform your CSV string into JSON first, followed by JSON extract, along the following lines:

... '["' || replace( user_action, '.', '", "' ) || '"]' AS replaced

... JSON_EXTRACT_ARRAY_ELEMENT_TEXT(replaced, numbers.i) AS parsed_action

Where "numbers" is the table from the first answer. The advantage of this approach is the ability to use built-in JSON functionality.

Upcast answered 11/1, 2017 at 2:36 Comment(0)
B
2

If you know that there are not many actions in your user_action column, you use recursive sub-querying with union all and therefore avoiding the aux numbers table.

But it requires you to know the number of actions for each user, either adjust initial table or make a view or a temporary table for it.

Data preparation

Assuming you have something like this as a table:

create temporary table actions 
(
    user_id varchar,
    user_name varchar,
    user_action varchar
);

I'll insert some values in it:

insert into actions
values (1, 'Shone', 'start,stop,cancel'),
       (2, 'Gregory', 'find,diagnose,taunt'),
       (3, 'Robot', 'kill,destroy');

Here's an additional table with temporary count

create temporary table actions_with_counts 
(
    id varchar,
    name varchar,
    num_actions integer,
    actions varchar
);

insert into actions_with_counts (
    select user_id,
           user_name,
           regexp_count(user_action, ',') + 1 as num_actions,
           user_action
    from actions
);

This would be our "input table" and it looks just as you expected

select * from actions_with_counts;
id name num_actions actions
2 Gregory 3 find,diagnose,taunt
3 Robot 2 kill,destroy
1 Shone 3 start,stop,cancel

Again, you can adjust initial table and therefore skipping adding counts as a separate table.

Sub-query to flatten the actions

Here's the unnesting query:

with recursive tmp (user_id, user_name, idx, user_action) as 
(
    select id,
           name,
           1 as idx,
           split_part(actions, ',', 1) as user_action
    from actions_with_counts

    union all

    select user_id,
           user_name,
           idx + 1 as idx,
           split_part(actions, ',', idx + 1)
    from actions_with_counts
         join tmp on actions_with_counts.id = tmp.user_id
    where idx < num_actions
)
select user_id, user_name, user_action as parsed_action
from tmp
order by user_id;

This will create a new row for each action, and the output would look like this:

user_id user_name parsed_action
1 Shone start
1 Shone stop
1 Shone cancel
2 Gregory find
2 Gregory diagnose
2 Gregory taunt
3 Robot kill
3 Robot destroy
Baillargeon answered 19/4, 2021 at 14:1 Comment(0)
E
1

Here are two ways to achieve this.

In my example, I'm assuming that I am accepting a comma separated list of values. My values look like schema.table.column.

The first involves using a recursive CTE.

drop table if exists #dep_tbl;

create table #dep_tbl as
select 'schema.foobar.insert_ts,schema.baz.load_ts' as dep
;

with recursive tmp (level, dep_split, to_split) as
                   (
                       select 1                          as level
                            , split_part(dep, ',', 1) as dep_split
                            , regexp_count(dep, ',')  as to_split
                       from #dep_tbl

                       union all

                       select tmp.level + 1                            as level
                            , split_part(a.dep, ',', tmp.level + 1) as dep_split_u
                            , tmp.to_split
                       from #dep_tbl a
                                inner join tmp on tmp.dep_split is not null
                           and tmp.level <= tmp.to_split
                   )
select dep_split from tmp;

the above yields:

|dep_split|

|schema.foobar.insert_ts| |schema.baz.load_ts|

The second involves a stored procedure.

CREATE OR REPLACE PROCEDURE so_test(dependencies_csv varchar(max))
 LANGUAGE plpgsql
AS $$
DECLARE
    dependencies_csv_vals varchar(max);
BEGIN

    drop table if exists #dep_holder;

    create table #dep_holder
    (
        avoid varchar(60000)
    );

    IF dependencies_csv is not null THEN
        dependencies_csv_vals:='('||replace(quote_literal(regexp_replace(dependencies_csv,'\\s','')),',', '\'),(\'') ||')';
        execute 'insert into #dep_holder values '||dependencies_csv_vals||';';
    END IF;

END;
$$
;

call so_test('schema.foobar.insert_ts,schema.baz.load_ts')

select
*
from
#dep_holder;

the above yields:

|dep_split|

|schema.foobar.insert_ts| |schema.baz.load_ts|

in conclusion

If you only care about one single column in your input (the X delimited values), then I think the stored procedure is easier/faster.

However, if you have other columns you care about and want to keep those columns along with your comma separated value column now transformed to rows, OR, if you want to know the argument (original list of delimited values), I think the stored procedure is the way to go. In that case, you can just add those other columns to your columns selected in the recursive query.

Electrolysis answered 13/6, 2022 at 18:58 Comment(0)
C
0

You can get the expected result with the following query. I'm using "UNION ALL" to convert a column to row.

select user_id, user_name, split_part(user_action,',',1) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',2) as parsed_action from cmd_logs
union all
select user_id, user_name, split_part(user_action,',',3) as parsed_action from cmd_logs
Caddell answered 7/8, 2014 at 8:28 Comment(2)
why was that downvoted? this is the cleanest working solution. You just have to get rid of empty values then (if there is no value on requested position it will return an empty string)Principled
This only specifies going for 3 commas separated values.Sunfast
C
0

Here's my equally-terrible answer.

I have a users table, and then an events table with a column that is just a comma-delimited string of users at said event. eg

event_id | user_ids
1        | 5,18,25,99,105

In this case, I used the LIKE and wildcard functions to build a new table that represents each event-user edge.

SELECT e.event_id, u.id as user_id
FROM events e
LEFT JOIN users u ON e.user_ids like '%' || u.id || '%'

It's not pretty, but I throw it in a WITH clause so that I don't have to run it more than once per query. I'll likely just build an ETL to create that table every night anyway.

Also, this only works if you have a second table that does have one row per unique possibility. If not, you could do LISTAGG to get a single cell with all your values, export that to a CSV and reupload that as a table to help.

Like I said: a terrible, no-good solution.

Caliber answered 7/9, 2016 at 16:43 Comment(0)
O
0

Late to the party but I got something working (albeit very slow though)

with nums as (select n::int n
from
  (select 
      row_number() over (order by true) as n
   from table_with_enough_rows_to_cover_range)
cross join
  (select 
      max(json_array_length(json_column)) as max_num 
   from table_with_json_column )
where
  n <= max_num + 1)
select *, json_extract_array_element_text(json_column,nums.n-1) parsed_json
from  nums, table_with_json_column
where json_extract_array_element_text(json_column,nums.n-1) != ''
and nums.n <= json_array_length(json_column) 

Thanks to answer by Bob Baxley for inspiration

Overijssel answered 11/1, 2017 at 18:36 Comment(0)
B
0

Just improvement for the answer above https://mcmap.net/q/389089/-redshift-convert-comma-delimited-values-into-rows

Is generating numbers table using the following SQL https://discourse.looker.com/t/generating-a-numbers-table-in-mysql-and-redshift/482

SELECT 
  p0.n 
  + p1.n*2 
  + p2.n * POWER(2,2) 
  + p3.n * POWER(2,3)
  + p4.n * POWER(2,4)
  + p5.n * POWER(2,5)
  + p6.n * POWER(2,6)
  + p7.n * POWER(2,7) 
  as number  
INTO numbers
FROM  
  (SELECT 0 as n UNION SELECT 1) p0,  
  (SELECT 0 as n UNION SELECT 1) p1,  
  (SELECT 0 as n UNION SELECT 1) p2, 
  (SELECT 0 as n UNION SELECT 1) p3,
  (SELECT 0 as n UNION SELECT 1) p4,
  (SELECT 0 as n UNION SELECT 1) p5,
  (SELECT 0 as n UNION SELECT 1) p6,
  (SELECT 0 as n UNION SELECT 1) p7
ORDER BY 1
LIMIT 100

"ORDER BY" is there only in case you want paste it without the INTO clause and see the results

Bigford answered 16/6, 2018 at 11:4 Comment(0)
R
0

create a stored procedure that will parse string dynamically and populatetemp table, select from temp table.

here is the magic code:-

  CREATE OR REPLACE PROCEDURE public.sp_string_split( "string" character varying )
AS $$
DECLARE 
  cnt INTEGER := 1;
    no_of_parts INTEGER := (select REGEXP_COUNT ( string , ','  ));
    sql VARCHAR(MAX) := '';
    item character varying := '';
BEGIN

  -- Create table
  sql := 'CREATE TEMPORARY TABLE IF NOT EXISTS split_table (part VARCHAR(255)) ';
  RAISE NOTICE 'executing sql %', sql ;
  EXECUTE sql;

  <<simple_loop_exit_continue>>
  LOOP
    item = (select split_part("string",',',cnt)); 
    RAISE NOTICE 'item %', item ;
    sql := 'INSERT INTO split_table SELECT '''||item||''' ';
    EXECUTE sql;
    cnt = cnt + 1;
    EXIT simple_loop_exit_continue WHEN (cnt >= no_of_parts + 2);
  END LOOP;

END ;
$$ LANGUAGE plpgsql;


Usage example:-

  call public.sp_string_split('john,smith,jones');
select *
from split_table

Ricciardi answered 15/4, 2020 at 14:18 Comment(0)
W
-7

You can try copy command to copy your file into redshift tables

copy table_name from 's3://mybucket/myfolder/my.csv' CREDENTIALS 'aws_access_key_id=my_aws_acc_key;aws_secret_access_key=my_aws_sec_key' delimiter ','

You can use delimiter ',' option.

For more details of copy command options you can visit this page

http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Weaks answered 4/8, 2014 at 6:57 Comment(1)
you can do something like this #17309169Weaks

© 2022 - 2024 — McMap. All rights reserved.