Connected Components

Asked 23/10, 2016 at 15:15 Answered 28/10, 2016 at 6:40

I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".

Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.

I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.

Please note:

Due to company restrictions, I cannot use stored procedures.
I can't use user defined functions (UDFs)
I can't use user defined types (UDTs)

A correct answer will

Be written for Teradata or compatible
Be more efficient than my answer
Respect the restrictions I mention above

Sample Input:

Desired Output:

Boisleduc answered 23/10, 2016 at 15:15 Comment(1)

Related: Query to validate if items exist in partition group, Algorithm for grouping related items – Boisleduc 23/10, 2016 at 23:11

You need a recursive apporach, but your WITH RECURSIVE creates a humongous intermediate result, which leads to no more spool.

For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

Now repeat the following update until 0 rows processed:

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

To get the related categories you finally need:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

For an exact match of your expected result you need to add a DENSE_RANK:

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt

Allman answered 23/10, 2016 at 17:5 Comment(8)

Thank you. +1 for your answer. I will try this when I get in front of a computer. One thing - is there any way to iterate the update statement automatically (in SQLA) e.g. without repeating the statement 60 times? I had thought to use an UPDATE using a cross-join, but unfortunately TD (14.00.03 I believe) required that UPDATEs be 1:1 – Boisleduc 23/10, 2016 at 17:17

@transistor1: You probably don't need 60 recursions, but SQLA doesn't supports loops. You might put the update in a macro and then call it instead, e.g. exec macro; .IF ACTIVITYCOUNT = 0 THEN GOTO .finish; ... repeat IF/EXEC n-times; LABEL .finish; final SELECT. – Allman 23/10, 2016 at 17:29

This may work.. for whatever reason, I believe I can create macros in my workdb. I wasn't aware they supported iteration. – Boisleduc 23/10, 2016 at 17:36

@transistor1: Macros don't suport iteration, they just simplify reusing the same source code multiple times. Btw, you can always create macros within your own user, DBAs don_t rstrict that.. – Allman 23/10, 2016 at 17:40

That .IF syntax is BTEQ syntax, right? Does that work in SQLA? I ask because the person who will be running this query (weekly) will be using SQLA. – Boisleduc 23/10, 2016 at 17:45

@transistor1: Yep, it's BTEQ syntax, but also supported by SQLA. Maybe you can convince your DBA, that this is perfect task for a Stored Procedure. – Allman 23/10, 2016 at 17:53

Thanks! At the risk of getting into an extended conversation, let's just say I've already been down that road and there is too much red tape & corporate politics surrounding that option at this time... – Boisleduc 23/10, 2016 at 17:57

"For to the one who has, more will be given, and he will have an abundance, but from the one who has not, even what he has will be taken away." Matt 13:12. Thanks for the help! – Boisleduc 27/10, 2016 at 21:8

The concept of this solution is to avoid cycles while traversing through the edges.
It is done by computing the path's bitmap on the run and avoiding adding an edge which its category_2 is already in the bitmap.
All path starts with a self-referencing edge (e.g. 'A'--'A') in order to take the first node bitmap.
This solution lowers the number of result records of the iterative sub-query (edges) to 70.

The limitation of this solution is that the bitmap size should be defined in advance.
Every digit in the varbinary literal stands for 4 bits.
64 bit = '0000000000000000'xb
128 bit = '00000000000000000000000000000000'xb
etc.

with        category_group_bitmap (group_id,category_1,bitflag_group)
            as 
            (
                select      group_id
                           ,category_1

                           ,dense_rank () over 
                            (
                                partition by    group_id 
                                order by        sum (distinct category_2_n)
                            ) as bitflag_group

                from        edges

                group by    group_id
                           ,category_1
            )  

           ,recursive edges (n,group_id,category_1,category_2,categories_num,bitmap,category_2_n)
            as
            (
                select      1                                           as n
                           ,m.group_id
                           ,m.category_1
                           ,m.category_2
                           ,gc.categories_num
                           ,setbit ('0000000000000000'xb,m.category_2_n)  as bitmap
                           ,m.category_2_n

                from                    match_detail_category_2_n   as m

                            join        group_categories_num        as gc

                            on          gc.group_id =
                                        m.group_id

                where       m.category_1    =
                            m.category_2

                union all

                select      e.n + 1                             as n
                           ,e.group_id
                           ,e.category_1
                           ,m.category_2
                           ,e.categories_num
                           ,setbit (e.bitmap,m.category_2_n)    as bitmap
                           ,m.category_2_n

                from                    edges                       as e

                            join        match_detail_category_2_n   as m

                            on          m.group_id  =
                                        e.group_id

                                    and m.category_1    =
                                        e.category_2

                where       e.n <   e.categories_num - 1
                        and getbit (e.bitmap,m.category_2_n) = 0
            )

           ,match_detail_category_2_n (group_id,category_1,category_2,category_2_n)
            as 
            (   
                select      m.group_id
                           ,category_1
                           ,category_2

                           ,cast 
                            (
                                dense_rank () over 
                                (
                                    partition by    group_id 
                                    order by        category_2
                                ) - 1 

                                as byteint
                            )

                from        match_detail as m
            )

           ,group_categories_num (group_id,categories_num)
            as 
            (   
                select      group_id
                           ,count (distinct category_1)

                from        match_detail

                group by    group_id
            )

select      *

from        category_group_bitmap
;

Reptile answered 28/10, 2016 at 6:40 Comment(8)

@transistor1, just waned to verify you've seen this solution. – Twocolor 28/10, 2016 at 10:24

@transistor1, when would be know if it survives production? – Twocolor 28/10, 2016 at 12:19

If I can confirm that this works more efficiently than @dnoeth's solution, I'll change the accepted answer. I'm not going to provide any timeline. The bounty was created specifically for him, though, so that won't change-- and of course it's already awarded, so I can't change it if I wanted to! Whenever I'm searching for a Teradata solution, I almost always land on one of his answers; he really came through for me here as well, and I appreciate that. That being said, your solution is really cool and it looks very promising. – Boisleduc 28/10, 2016 at 12:52

When I've started working on Teradata database @Allman was the one only saviour , so no need to convince me here :-) Also dnoeth solution is not based on bitmap and therefore is not limited by bitmap size, so in any case it should be marked as a correct solution. I'm simply curious. P.s. Earlier I've posted here a solution based on stored procedures that I'm using in our production env. The logic is similar to dnoeth solution and its working on hundreds of million of records, executing for 1-1.5 minute. – Twocolor 28/10, 2016 at 13:15

thanks for your understanding :-). I don't know if there's such a thing as a "Teradata MVP," but if there is, he should get it. I'm testing your answer now. Yes, I saw that answer, but in my current situation it's nearly impossible to get SP access where I am working now. I am in a very large company and there is lots of red tape & politics unfortunately. – Boisleduc 28/10, 2016 at 13:22

Let us continue this discussion in chat. – Twocolor 28/10, 2016 at 13:23

@DuduMarkovitz: No need for multiple bitmaps, you can use a VARBYTE of any (?) size instead: select setbit ('000000000000000000000000000000000000000000000000'xb,64); Would be interesting to compare performance of the WHILE-loop and this recursion. – Allman 5/11, 2016 at 18:43

@Allman - I've overlooked this in the documentation - this is so cool ! P.s. TD 14.10/15.10 doc.: "The maximum supported size (n) for VARBYTE is 8192 bytes." – Twocolor 5/11, 2016 at 19:8

This works, but causes an "out of spool space" issue on real data.

Schema creation:

CREATE VOLATILE TABLE match_detail (
    group_id bigint
    , category_1 varchar(255)
    , category_2 varchar(255)
) PRIMARY INDEX (group_id)
    ON COMMIT PRESERVE ROWS;

INSERT INTO match_detail VALUES (1,'A','B');
INSERT INTO match_detail VALUES (1,'A','A');
INSERT INTO match_detail VALUES (1,'B','A');
INSERT INTO match_detail VALUES (1,'B','C');
INSERT INTO match_detail VALUES (1,'B','B');
INSERT INTO match_detail VALUES (1,'C','B');
INSERT INTO match_detail VALUES (1,'C','D');
INSERT INTO match_detail VALUES (1,'C','C');
INSERT INTO match_detail VALUES (1,'D','C');
INSERT INTO match_detail VALUES (1,'D','E');
INSERT INTO match_detail VALUES (1,'D','D');
INSERT INTO match_detail VALUES (1,'E','D');
INSERT INTO match_detail VALUES (1,'E','F');
INSERT INTO match_detail VALUES (1,'E','E');
INSERT INTO match_detail VALUES (1,'F','E');
INSERT INTO match_detail VALUES (1,'F','G');
INSERT INTO match_detail VALUES (1,'F','F');
INSERT INTO match_detail VALUES (1,'G','F');
INSERT INTO match_detail VALUES (1,'G','G');
INSERT INTO match_detail VALUES (1,'W','X');
INSERT INTO match_detail VALUES (1,'W','W');
INSERT INTO match_detail VALUES (1,'W','Y');
INSERT INTO match_detail VALUES (1,'W','Z');
INSERT INTO match_detail VALUES (1,'X','W');
INSERT INTO match_detail VALUES (1,'X','X');
INSERT INTO match_detail VALUES (1,'Y','W');
INSERT INTO match_detail VALUES (1,'Y','Y');
INSERT INTO match_detail VALUES (1,'Z','W');
INSERT INTO match_detail VALUES (1,'Z','Z');
INSERT INTO match_detail VALUES (2,'L','L');
INSERT INTO match_detail VALUES (2,'M','N');
INSERT INTO match_detail VALUES (2,'N','M');
INSERT INTO match_detail VALUES (2,'M','M');
INSERT INTO match_detail VALUES (2,'N','N');

Query:

WITH

 related_cats AS (
    SELECT
        group_id
        , category_1
        , SUM(DISTINCT bitflag) As bitflag_total
        , DENSE_RANK() OVER (
            PARTITION BY
                group_id
            ORDER BY
                group_id
                , bitflag_total
        ) As bitflag_group

    FROM bitflags

    GROUP BY 1, 2
)

, bitflags As (
    SELECT
        DISTINCT
            group_id
            , category_1
            , category_2
            , CAST
            (
                2 ** (DENSE_RANK() OVER (
                    PARTITION BY
                        group_id

                    ORDER BY group_id
                        , category_2) - 1)

                As bigint
            ) As bitflag

    FROM cat_join
    WHERE depth = 1
)

, RECURSIVE cat_join AS (
    SELECT DISTINCT
        c1.group_id
        , c1.category_1
        , c1.category_2
        , CAST
        (
            n.num_categories - 1
            As integer
        ) As max_depth
        , CASE
            WHEN c1.category_1 = c1.category_2 THEN 1
            ELSE max_depth
        END As depth
        , 1 As recursion

    FROM matches c1
    INNER JOIN num_categories n
        ON c1.group_id = n.group_id

    UNION ALL

    SELECT
        r1.group_id
        , r1.category_1
        , r2.category_2
        , r1.max_depth
        , CASE
            WHEN r1.category_1 = r1.category_2 THEN 1
            WHEN r1.category_1 = r2.category_2 THEN 1
            ELSE r1.depth - 1
        END As cur_depth
        , recursion + 1

        FROM cat_join r1
        INNER JOIN matches r2
            ON r1.group_id = r2.group_id
            AND r1.category_2 = r2.category_1

        WHERE
        r1.category_1 <> r2.category_2
        AND r1.category_1 <> r2.category_1
        AND
            (r1.depth - 1) > 0
)

, matches AS (
    SELECT
        d.group_id
        , d.category_1
        , d.category_2

    FROM match_detail d
    GROUP BY d.group_id, d.category_1, d.category_2
)

, num_categories AS (
    SELECT
        u.group_id
        , COUNT(DISTINCT u.category_2) AS num_categories

    FROM categories u
    GROUP BY u.group_id
)

, categories AS (
    SELECT DISTINCT
        u.group_id
        , u.category_1
        , u.category_2

    FROM match_detail u
)

SELECT *
FROM related_cats

Boisleduc answered 23/10, 2016 at 15:16 Comment(9)

Why is this posted as an answer? Shouldn't it be part of the question? – Plastometer 23/10, 2016 at 15:17

Because it works for the simple data set, and I wanted to prevent the question from being extremely long, which detracts from the question itself. – Boisleduc 23/10, 2016 at 15:20

Off-topic: How did you find out that within the WITH clause, a query_name can reference a following query_name, e.g. related_cats references bitflags?! In all other database it goes the other way around. – Twocolor 24/10, 2016 at 6:9

@DuduMarkovitz: I found out the hard way! It didn't work when using the correct order. It was very frustrating. I spent a lot of time Googling before I finally found that out. It seems like a bug. – Boisleduc 24/10, 2016 at 10:53

@DuduMarkovitz: You're not alone, I was telling every customer to open an incident (iirc DR160077) for years. This is a known "feature" since TD14, it's supposed to be fixed in TD16 (after 4 years). The interesting part will be how they handle the workaround with wrongly ordered CTEs, will those queries fail after the fix? – Allman 27/10, 2016 at 6:19

@dnoeth, "design limitation" :-). The support is talking about a DR for TD16. I'm still trying to figure out if their going to break the current behavior. – Twocolor 27/10, 2016 at 8:21

@DuduMarkovitz: Apparently no "intelligent design" :) – Allman 27/10, 2016 at 8:51

@dnoeth, most likely, "no design, no Q.A." – Twocolor 27/10, 2016 at 12:4

@dnoeth, P.s. - According to Teradata's support, both ways should work in TD16 – Twocolor 5/11, 2016 at 19:32

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags