Redshift split single dynamic column into multiple rows in new table
Asked Answered
T

3

5

With a table like:

       uid               |            segmentids
-------------------------+----------------------------------------
 f9b6d54b-c646-4bbb-b0ec | 4454918|4455158|4455638|4455878|4455998
 asd7a0s9-c646-asd7-b0ec | 1265899|1265923|1265935|1266826|1266596
 gd3355ff-cjr8-assa-fke0 | 2237557|2237581|2237593
 laksnfo3-kgi5-fke0-b0ec | 4454918|4455158|4455638|4455878

How to create a new table with:

      uid               |         segmentids
-------------------------+---------------------------
 f9b6d54b-c646-4bbb-b0ec |           4454918
 f9b6d54b-c646-4bbb-b0ec |           1265899
 f9b6d54b-c646-4bbb-b0ec |           2237557
 f9b6d54b-c646-4bbb-b0ec |           4454918
 f9b6d54b-c646-4bbb-b0ec |           4454918
 asd7a0s9-c646-asd7-b0ec |           1265899
 asd7a0s9-c646-asd7-b0ec |           1265923
 asd7a0s9-c646-asd7-b0ec |           1265935 
 asd7a0s9-c646-asd7-b0ec |           1266826
 asd7a0s9-c646-asd7-b0ec |           1266596

The number of segments are dynamic, can vary with each record. I tried the Split function with delimiter, but it requires the index in string, which is dynamic here.

Any suggestions?

Threadbare answered 17/10, 2017 at 7:29 Comment(3)
You tag several dbms products, how come?Libbylibeccio
what is dynamic here? you are using index =1 in your example - show some different sample to explain pleaseAgitato
i understand the question - please change your tags to be just the one you are using as the answer is completely different!Dambro
D
12

Here is the Redshift answer, it will work with up to 10 thousand segment ids values per row.

test data

create table test_split (uid varchar(50),segmentids varchar(max));
insert into test_split
    values
      ('f9b6d54b-c646-4bbb-b0ec','4454918|4455158|4455638|4455878|4455998'),
      ('asd7a0s9-c646-asd7-b0ec','1265899|1265923|1265935|1266826|1266596'),
      ('asd7345s9-c646-asd7-b0ec','1235935|1263456|1265675696'),
      ('as345a0s9-c646-asd7-b0ec','12765899|12658883|12777935|144466826|1266226|12345')
;

code

with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
  , generted_numbers AS
(
    SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
    FROM ten_numbers AS t1
      JOIN ten_numbers AS t2 ON 1 = 1
      JOIN ten_numbers AS t3 ON 1 = 1
      JOIN ten_numbers AS t4 ON 1 = 1
)
  , splitter AS
(
    SELECT *
    FROM generted_numbers
    WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(segmentids, '\\|') + 1)
                                 FROM test_split)
)
  --select * from splitter;
  , expanded_input AS
(
    SELECT
      uid,
      split_part(segmentids, '|', s.gen_num) AS segment
    FROM test_split AS ts
      JOIN splitter AS s ON 1 = 1
    WHERE split_part(segmentids, '|', s.gen_num) <> ''
)
SELECT * FROM expanded_input;

the first 2 cte steps (ten_numbers and generated_numbers) are used to generate a number of rows, this is needed because generate_series is not supported

The next step (splitter) just takes a number of rows equal to the max number of delimiters + 1 (which is the max number of segments)

finally, we cross join splitter with the input data, take the related value using split_part and then exclude blank parts (which are caused where the row has < the max number of segments)

Dambro answered 17/10, 2017 at 8:14 Comment(5)
Working like a charm. Still trying to figure out how. I get the indexing part, but how are you matching it with uids as in the original table?Threadbare
And why are we using values like 1000, 100 and 10 in ten_numbers?Threadbare
it is generating 10000 rows of output. with gen_num incrementing for each row. normally you could use generate_series function for this but that is not possible in redshift so this is a clever workaround. if you dont need 10000 then you could remove that part. there is virtually no overhead in doing this though.Dambro
The matching is done as a cross join (FROM test_split AS ts JOIN splitter AS s ON 1 = 1). every test_split gets merged with every splitter row. we only have splitter rows from 1 up to the max number of segments anywhere in the file.Dambro
Its beautiful however -- in my time I've come to learn the value immediately readable code. The proper solution is to include a stage in the pipeline that makes functional transformations in spark or similar. Then this difficult logic becomes df.withColumn("segment_id", functions.explode("seg_array")), after converting the delimited column to any array via select(split(col("segmentids"),"|").alias("seg_array"))Felty
F
3

You can iterate over the SUPER array returned by split_to_array -- see the "Unnesting and flattening" section of this post. Using the same test_split table as the previous answer:

WITH seg_array AS
  (SELECT uid,
          split_to_array(segmentids, '|') segs
   FROM test_split)
SELECT uid,
       segmentid::int
FROM seg_array a,
     a.segs AS segmentid;
Fiora answered 22/4, 2022 at 6:16 Comment(0)
S
1

Redshift now has the super data type & the split_to_array function which is similar to postgresql string_to_array

Redshift now also supports unnesting arrays through a syntax similar to a LATERAL JOIN in postgresql.

Using these techniques, we may write the same transformation in 2022 as

WITH split_up AS (
SELECT
  uid
, split_to_array(segmentids) segment_array
)
SELECT
  su.uid
, CAST(sid AS VARCHAR) segmentid
FROM split_up su
JOIN split_up.segment_array sid ON TRUE
Simone answered 22/3, 2022 at 2:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.