Connected Components
Asked Answered
B

3

4

I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".

Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.

I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.

Please note:

  • Due to company restrictions, I cannot use stored procedures.
  • I can't use user defined functions (UDFs)
  • I can't use user defined types (UDTs)

A correct answer will

  • Be written for Teradata or compatible
  • Be more efficient than my answer
  • Respect the restrictions I mention above

Sample Input:

Sample Input

Desired Output:

Desired Output

Boisleduc answered 23/10, 2016 at 15:15 Comment(1)
A
3

You need a recursive apporach, but your WITH RECURSIVE creates a humongous intermediate result, which leads to no more spool.

For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

Now repeat the following update until 0 rows processed:

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

To get the related categories you finally need:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

For an exact match of your expected result you need to add a DENSE_RANK:

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt
Allman answered 23/10, 2016 at 17:5 Comment(8)
Thank you. +1 for your answer. I will try this when I get in front of a computer. One thing - is there any way to iterate the update statement automatically (in SQLA) e.g. without repeating the statement 60 times? I had thought to use an UPDATE using a cross-join, but unfortunately TD (14.00.03 I believe) required that UPDATEs be 1:1Boisleduc
@transistor1: You probably don't need 60 recursions, but SQLA doesn't supports loops. You might put the update in a macro and then call it instead, e.g. exec macro; .IF ACTIVITYCOUNT = 0 THEN GOTO .finish; ... repeat IF/EXEC n-times; LABEL .finish; final SELECT.Allman
This may work.. for whatever reason, I believe I can create macros in my workdb. I wasn't aware they supported iteration.Boisleduc
@transistor1: Macros don't suport iteration, they just simplify reusing the same source code multiple times. Btw, you can always create macros within your own user, DBAs don_t rstrict that..Allman
That .IF syntax is BTEQ syntax, right? Does that work in SQLA? I ask because the person who will be running this query (weekly) will be using SQLA.Boisleduc
@transistor1: Yep, it's BTEQ syntax, but also supported by SQLA. Maybe you can convince your DBA, that this is perfect task for a Stored Procedure.Allman
Thanks! At the risk of getting into an extended conversation, let's just say I've already been down that road and there is too much red tape & corporate politics surrounding that option at this time...Boisleduc
"For to the one who has, more will be given, and he will have an abundance, but from the one who has not, even what he has will be taken away." Matt 13:12. Thanks for the help!Boisleduc
R
2

The concept of this solution is to avoid cycles while traversing through the edges.
It is done by computing the path's bitmap on the run and avoiding adding an edge which its category_2 is already in the bitmap.
All path starts with a self-referencing edge (e.g. 'A'--'A') in order to take the first node bitmap.
This solution lowers the number of result records of the iterative sub-query (edges) to 70.


The limitation of this solution is that the bitmap size should be defined in advance.
Every digit in the varbinary literal stands for 4 bits.
64 bit = '0000000000000000'xb
128 bit = '00000000000000000000000000000000'xb
etc.


with        category_group_bitmap (group_id,category_1,bitflag_group)
            as 
            (
                select      group_id
                           ,category_1

                           ,dense_rank () over 
                            (
                                partition by    group_id 
                                order by        sum (distinct category_2_n)
                            ) as bitflag_group

                from        edges

                group by    group_id
                           ,category_1
            )  

           ,recursive edges (n,group_id,category_1,category_2,categories_num,bitmap,category_2_n)
            as
            (
                select      1                                           as n
                           ,m.group_id
                           ,m.category_1
                           ,m.category_2
                           ,gc.categories_num
                           ,setbit ('0000000000000000'xb,m.category_2_n)  as bitmap
                           ,m.category_2_n

                from                    match_detail_category_2_n   as m

                            join        group_categories_num        as gc

                            on          gc.group_id =
                                        m.group_id

                where       m.category_1    =
                            m.category_2

                union all

                select      e.n + 1                             as n
                           ,e.group_id
                           ,e.category_1
                           ,m.category_2
                           ,e.categories_num
                           ,setbit (e.bitmap,m.category_2_n)    as bitmap
                           ,m.category_2_n

                from                    edges                       as e

                            join        match_detail_category_2_n   as m

                            on          m.group_id  =
                                        e.group_id

                                    and m.category_1    =
                                        e.category_2

                where       e.n <   e.categories_num - 1
                        and getbit (e.bitmap,m.category_2_n) = 0
            )

           ,match_detail_category_2_n (group_id,category_1,category_2,category_2_n)
            as 
            (   
                select      m.group_id
                           ,category_1
                           ,category_2

                           ,cast 
                            (
                                dense_rank () over 
                                (
                                    partition by    group_id 
                                    order by        category_2
                                ) - 1 

                                as byteint
                            )

                from        match_detail as m
            )

           ,group_categories_num (group_id,categories_num)
            as 
            (   
                select      group_id
                           ,count (distinct category_1)

                from        match_detail

                group by    group_id
            )

select      *

from        category_group_bitmap
;
Reptile answered 28/10, 2016 at 6:40 Comment(8)
@transistor1, just waned to verify you've seen this solution.Twocolor
@transistor1, when would be know if it survives production?Twocolor
If I can confirm that this works more efficiently than @dnoeth's solution, I'll change the accepted answer. I'm not going to provide any timeline. The bounty was created specifically for him, though, so that won't change-- and of course it's already awarded, so I can't change it if I wanted to! Whenever I'm searching for a Teradata solution, I almost always land on one of his answers; he really came through for me here as well, and I appreciate that. That being said, your solution is really cool and it looks very promising.Boisleduc
When I've started working on Teradata database @Allman was the one only saviour , so no need to convince me here :-) Also dnoeth solution is not based on bitmap and therefore is not limited by bitmap size, so in any case it should be marked as a correct solution. I'm simply curious. P.s. Earlier I've posted here a solution based on stored procedures that I'm using in our production env. The logic is similar to dnoeth solution and its working on hundreds of million of records, executing for 1-1.5 minute.Twocolor
thanks for your understanding :-). I don't know if there's such a thing as a "Teradata MVP," but if there is, he should get it. I'm testing your answer now. Yes, I saw that answer, but in my current situation it's nearly impossible to get SP access where I am working now. I am in a very large company and there is lots of red tape & politics unfortunately.Boisleduc
Let us continue this discussion in chat.Twocolor
@DuduMarkovitz: No need for multiple bitmaps, you can use a VARBYTE of any (?) size instead: select setbit ('000000000000000000000000000000000000000000000000'xb,64); Would be interesting to compare performance of the WHILE-loop and this recursion.Allman
@Allman - I've overlooked this in the documentation - this is so cool ! P.s. TD 14.10/15.10 doc.: "The maximum supported size (n) for VARBYTE is 8192 bytes."Twocolor
B
1

This works, but causes an "out of spool space" issue on real data.

Schema creation:

CREATE VOLATILE TABLE match_detail (
    group_id bigint
    , category_1 varchar(255)
    , category_2 varchar(255)
) PRIMARY INDEX (group_id)
    ON COMMIT PRESERVE ROWS;

INSERT INTO match_detail VALUES (1,'A','B');
INSERT INTO match_detail VALUES (1,'A','A');
INSERT INTO match_detail VALUES (1,'B','A');
INSERT INTO match_detail VALUES (1,'B','C');
INSERT INTO match_detail VALUES (1,'B','B');
INSERT INTO match_detail VALUES (1,'C','B');
INSERT INTO match_detail VALUES (1,'C','D');
INSERT INTO match_detail VALUES (1,'C','C');
INSERT INTO match_detail VALUES (1,'D','C');
INSERT INTO match_detail VALUES (1,'D','E');
INSERT INTO match_detail VALUES (1,'D','D');
INSERT INTO match_detail VALUES (1,'E','D');
INSERT INTO match_detail VALUES (1,'E','F');
INSERT INTO match_detail VALUES (1,'E','E');
INSERT INTO match_detail VALUES (1,'F','E');
INSERT INTO match_detail VALUES (1,'F','G');
INSERT INTO match_detail VALUES (1,'F','F');
INSERT INTO match_detail VALUES (1,'G','F');
INSERT INTO match_detail VALUES (1,'G','G');
INSERT INTO match_detail VALUES (1,'W','X');
INSERT INTO match_detail VALUES (1,'W','W');
INSERT INTO match_detail VALUES (1,'W','Y');
INSERT INTO match_detail VALUES (1,'W','Z');
INSERT INTO match_detail VALUES (1,'X','W');
INSERT INTO match_detail VALUES (1,'X','X');
INSERT INTO match_detail VALUES (1,'Y','W');
INSERT INTO match_detail VALUES (1,'Y','Y');
INSERT INTO match_detail VALUES (1,'Z','W');
INSERT INTO match_detail VALUES (1,'Z','Z');
INSERT INTO match_detail VALUES (2,'L','L');
INSERT INTO match_detail VALUES (2,'M','N');
INSERT INTO match_detail VALUES (2,'N','M');
INSERT INTO match_detail VALUES (2,'M','M');
INSERT INTO match_detail VALUES (2,'N','N');

Query:

WITH

 related_cats AS (
    SELECT
        group_id
        , category_1
        , SUM(DISTINCT bitflag) As bitflag_total
        , DENSE_RANK() OVER (
            PARTITION BY
                group_id
            ORDER BY
                group_id
                , bitflag_total
        ) As bitflag_group

    FROM bitflags

    GROUP BY 1, 2
)

, bitflags As (
    SELECT
        DISTINCT
            group_id
            , category_1
            , category_2
            , CAST
            (
                2 ** (DENSE_RANK() OVER (
                    PARTITION BY
                        group_id

                    ORDER BY group_id
                        , category_2) - 1)

                As bigint
            ) As bitflag

    FROM cat_join
    WHERE depth = 1
)

, RECURSIVE cat_join AS (
    SELECT DISTINCT
        c1.group_id
        , c1.category_1
        , c1.category_2
        , CAST
        (
            n.num_categories - 1
            As integer
        ) As max_depth
        , CASE
            WHEN c1.category_1 = c1.category_2 THEN 1
            ELSE max_depth
        END As depth
        , 1 As recursion

    FROM matches c1
    INNER JOIN num_categories n
        ON c1.group_id = n.group_id

    UNION ALL

    SELECT
        r1.group_id
        , r1.category_1
        , r2.category_2
        , r1.max_depth
        , CASE
            WHEN r1.category_1 = r1.category_2 THEN 1
            WHEN r1.category_1 = r2.category_2 THEN 1
            ELSE r1.depth - 1
        END As cur_depth
        , recursion + 1

        FROM cat_join r1
        INNER JOIN matches r2
            ON r1.group_id = r2.group_id
            AND r1.category_2 = r2.category_1

        WHERE
        r1.category_1 <> r2.category_2
        AND r1.category_1 <> r2.category_1
        AND
            (r1.depth - 1) > 0
)

, matches AS (
    SELECT
        d.group_id
        , d.category_1
        , d.category_2

    FROM match_detail d
    GROUP BY d.group_id, d.category_1, d.category_2
)

, num_categories AS (
    SELECT
        u.group_id
        , COUNT(DISTINCT u.category_2) AS num_categories

    FROM categories u
    GROUP BY u.group_id
)

, categories AS (
    SELECT DISTINCT
        u.group_id
        , u.category_1
        , u.category_2

    FROM match_detail u
)

SELECT *
FROM related_cats
Boisleduc answered 23/10, 2016 at 15:16 Comment(9)
Why is this posted as an answer? Shouldn't it be part of the question?Plastometer
Because it works for the simple data set, and I wanted to prevent the question from being extremely long, which detracts from the question itself.Boisleduc
Off-topic: How did you find out that within the WITH clause, a query_name can reference a following query_name, e.g. related_cats references bitflags?! In all other database it goes the other way around.Twocolor
@DuduMarkovitz: I found out the hard way! It didn't work when using the correct order. It was very frustrating. I spent a lot of time Googling before I finally found that out. It seems like a bug.Boisleduc
@DuduMarkovitz: You're not alone, I was telling every customer to open an incident (iirc DR160077) for years. This is a known "feature" since TD14, it's supposed to be fixed in TD16 (after 4 years). The interesting part will be how they handle the workaround with wrongly ordered CTEs, will those queries fail after the fix?Allman
@dnoeth, "design limitation" :-). The support is talking about a DR for TD16. I'm still trying to figure out if their going to break the current behavior.Twocolor
@DuduMarkovitz: Apparently no "intelligent design" :)Allman
@dnoeth, most likely, "no design, no Q.A."Twocolor
@dnoeth, P.s. - According to Teradata's support, both ways should work in TD16Twocolor

© 2022 - 2024 — McMap. All rights reserved.