Packing Ranges in Two Dimensions
Asked Answered
S

3

2

My Problem

Consider a set of data with two intervals. For instance, consider a student schedule of classes. Each record has a begin and end date, and each class has a period start time and a period end time. But this schedule is not 'normalized' in the sense that some records overlap. So if you search for records encompassing a given date and period for a student, you might get multiple matches.

Here's a contrived example. I represent the dates as integers to simplify the problem:

declare @schedule table (
    student char(3),
    fromDate int,
    toDate int,
    fromPeriod int,
    toPeriod int
)

insert @schedule values
    ('amy', 1, 7, 7, 9),
    ('amy', 3, 9, 5, 8), 
    ('amy', 10, 12, 1, 3), 
    ('ted', 1, 5, 11, 14),
    ('ted', 7, 11, 13, 16); 

Amy's date and period ranges either overlap or are adjacent. If I queried for 'date 5 period 7', I would get two matches. I need these reworked so that they represent the same 'area' but no longer overlap.

Ted's periods overlap but his dates do not. This means there's no real overlap, so no need to re-work anything.

My Research

I've read many posts and some articles on working overlapping intervals. Namely:

I've implemented one from Itzik from a blog entitled 'solutions-packing-date-and-time-intervals-puzzle' that has worked great for one particular project. I don't think it's a stable link, but I've found a copy of it here.

But I'm having difficulty extending the knowledge in those resources to my problem at hand. It might be my limitation. I have trouble following them. I've studied Itzik's solution and have come to understand a lot of it, but I do recall there's one piece I just couldn't understand. Or it might be that those solutions only work with singular ranges.

My Attempt

I resolved this question by treating the ranges as literal rectangle objects. It works. I've even made a version of it somewhat performant in my own application. So I'll post it as a solution in case it is of use to anyone with the same issue.

But it is so long and involved and there are enough quirks to it (e.g. buffering lines, looping shapes, working with float values, rounding issues) that I can't help but think that there's a much better way. Can the concepts of my listed resources be extended to dual ranges? Or do some SRID's allow cutting rectangles with zero-length lines?

Expected Results:

There is no one answer to this problem, because you can aggregate ranges and deconstruct them in different ways. But to minimize the number of resulting rectangles, there are really only two acceptable answers. Visually, with dates on the X axis and periods on the Y axis, overlapping ranges can start out like this:

 +------------+
 |            |
 |    +------------+
 |    ||||||||     |  <- 2 overlapping rectangles
 +----|            |
      |            |
      +------------+

We can rework it this way:

 +---+ +-----+
 |   | |     |
 |   | |     | +---+  <- 3 non-overlapping 
 |   | |     | |   |     vertically cut rectangles
 +---| |     | |   |
       |     | |   |
       +-----+ +---+

Or this way:

 +-----------+
 +-----------+

 +-----------------+  <- 3 non-overlapping 
 +-----------------+     horizontally cut rectangles

       +-----------+
       +-----------+

Going with vertical cuts, the results would look like this:

+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy    |1       |2     |7         |9       |
|amy    |3       |7     |5         |9       |
|amy    |8       |9     |5         |8       |
|amy    |10      |12    |1         |3       |
|ted    |1       |5     |11        |14      |
|ted    |7       |11    |13        |16      |
+-------------------------------------------+

Going with horizontal cuts, the results would look like this:

+-------------------------------------------+
|student|fromDate|toDate|fromPeriod|toPeriod|
|-------------------------------------------|
|amy    |1       |7     |9         |9       |
|amy    |1       |9     |7         |8       |
|amy    |3       |9     |5         |6       |
|amy    |10      |12    |1         |3       |
|ted    |1       |5     |11        |14      |
|ted    |7       |11    |13        |16      |
+-------------------------------------------+

Either is acceptable. Though, to keep it deterministic and tractable, you would want to choose one strategy and stick with it.

Subdominant answered 19/12, 2019 at 22:18 Comment(4)
Please show us the result that you would expect for your sample data.Cookery
There is no one answer to this problem, because you can aggregate ranges and recombine them in different ways.: so your result is not actually deterministic? This sounds like an akward requirement...Cookery
@GMB, it can very well be deterministic. You can write an algorithm that, with a particular input, gives the same expected output. I'm just saying that different criteria for the results can be acceptable for the goal at hand.Subdominant
@GMB, reworked the expected results. Hopefully it addresses your concern better. Thanks for your feedback.Subdominant
S
1

I had the pleasure of attending one of Itzik's trainings, and presented this to him. I only asked for possible references but he provided the solution! Presented here with his verbal permission. Some small modifications for explanatory value and to work with the existing numbers table in the question. Any errors as a result are of course my own:

WITH pixellateAndBinPeriods AS (
    SELECT
        s.student,
        date = dates.i,
        period = periods.i,

        -- periods.i increments by 1 if the period before it is consecutive, by more than 1 otherwise
        -- denserank increments by 1 no mater what
        -- subtracting out the denserank produces the same number when consecutive, a different number when not
        periodBin = 
            periods.i -- increments in order if consecutive
            - DENSE_RANK() OVER (PARTITION BY s.student, dates.i ORDER BY periods.i) -- increments in order no matter what
            -- subtracting in these conditions creates an distinct identifier for consecutive series

    FROM @schedule AS s
    INNER JOIN #numbers dates 
        ON dates.i >= s.fromdate 
        AND dates.i <= s.todate
    INNER JOIN #numbers periods 
        ON periods.i >= s.fromperiod 
        AND periods.i <= s.toperiod
),
packPeriodsAndBinDates AS (
    SELECT
        student,
        date,
        fromperiod = MIN(period),
        toperiod = MAX(period),

        -- Same logic as periodBin, but also partition by from and to period to group together 
        -- consecutive dates only if they have the same period ranges.
        dateBin = date - DENSE_RANK() OVER (PARTITION BY student, MIN(period), MAX(period) ORDER BY date)

    FROM pixellateAndBinPeriods
    GROUP BY
        student,
        date,
        periodBin
)
-- pack the dates
SELECT
    student,
    fromdate = MIN(date),
    todate = MAX(date),
    fromperiod,
    toperiod
INTO #normalized
FROM packPeriodsAndBinDates
GROUP BY
    student,
    fromperiod,
    toperiod,
    dateBin
ORDER BY
    student,
    fromdate,
    fromperiod;

Unnormalized Schedule:

schedule

Pixelation:

#pixellateAndBinPeriods

Packing by Period:

packPeriodsAndBinDates

Packing by Date:

enter image description here

Subdominant answered 19/12, 2019 at 22:18 Comment(0)
S
0

Numbers table:

To address the problem geometrically as I indicate in my post, you have to work with the SQL Server geometry data type. Unfortunately, to get each individual shape or point inside a geometry value, you have to call for the shape by index. A numbers table helps with this. So I do that first (swap this out for your preferred implementation).

create table #numbers (i int);

declare @i int = 1;
while @i <= 100 begin
    insert #numbers values (@i);
    set @i += 1;
end;

Aggregate the Ranges:

The first required task is to convert the numeric ranges to geometric rectangles. Point creates the corner points. STUnion and STEnvelope serve to turn these into a rectangle. Also, since we desire ranges to merge together when they are integer-adjacent, we add 1 to the 'to' fields before geometric conversion.

Then the rectangles must be unioned so that there are no overlaps. This is done by UnionAggregate. The result is a geometry object of rectilinearPolygons (boxy shapes).

The geometry object can still have multiple rectillinearPolygons. So these are listed and output as individual shapes to rectilinears.

with

    aggregateRectangles as (
    
        select      student, 
                    rectilinears = geometry::UnionAggregate(rectangle)
        from        @schedule s
        cross apply (select 
                        minPt = geometry::Point(s.fromDate, s.fromPeriod, 0),
                        maxPt = geometry::Point(s.toDate + 1, s.toPeriod + 1, 0)
                    ) extremePoints
        cross apply (select rectangle = minPt.STUnion(maxPt).STEnvelope()) enveloped
        group by    student 

    )

    select      ar.student, 
                r.rectilinear,
                mm.minY,
                mm.maxY
    into        #rectilinears
    from        aggregateRectangles ar
    join        #numbers n on n.i between 1 and ar.rectilinears.STNumGeometries()
    cross apply (select rectilinear = ar.rectilinears.STGeometryN(n.i)) r
    cross apply (select envelope = r.rectilinear.STEnvelope()) e
    cross apply (select 
                    minY = e.envelope.STPointN(1).STY, 
                    maxY = e.envelope.STPointN(3).STY 
                ) mm;

SideNote - Performance Option:

I'm not implementing it here. But if you're working with big-data, and your 'rectilinears' (plural) field above is shared among many groupings (such as many students having the same schedule), then save the Well-known-text version of the rectilinear object (Just do ToString()). After this, create a second dataset with distinct rectilinears and perform the remaining geometric operations on that condensed dataset. Join it back in to the student-level later on. This has significantly improved performance in my real case.

Decompose the Ranges:

Next, those rectilinears have to be decomposed back into rectangles. Splitters are created by creating vertical lines at the x coordinates of each point. The y axis could just as easily been chosen, I just chose x for my own semantics. Both axis could also have been chosen, but this would result in more records than necessary.

Unfortunately, SQL Server does not split a shape if the splitter has zero-width (set-theoretically, that's inappropriate, but I imagine that you can't represent the result properly in WKT format). So we need to give the splitters a buffer so that they have an area. There's STBuffer, though I've had trouble with it so I just create one manually.

With this, the rectangles are split. When they're split, they still all reside in the same geometry object, so they enumerated and then inserted individually into the #rectangles table.

with

    createSplitters as (

        select      r.student,
                    rectilinear = geometry::STGeomFromText(r.rectilinear.ToString(), 0),
                    splitters = geometry::UnionAggregate(sp.splitter)
        from        #rectilinears r
        join        #numbers n on n.i between 1 and r.rectilinear.STNumPoints()
        cross apply (select 
                        x = r.rectilinear.STPointN(n.i).STX,
                        buffer = 0.001
                    ) px
        cross apply (select splitter =
                        geometry::Point(x - buffer, minY - buffer, 0).STUnion(
                            geometry::Point(x + buffer, maxY + buffer, 0)
                        ).STEnvelope()
                    ) sp
        group by    r.student,
                    r.rectilinear.ToString()

    )

    select      student,
                rectangle = rectangles.STGeometryN(n.i)
    into        #rectangles
    from        createSplitters sp
    cross apply (select 
                    rectangles = rectilinear.STDifference(sp.splitters)
                ) r
    join        #numbers n on n.i between 1 and r.rectangles.STNumGeometries();

Parse the Ranges:

That's the crux of it. What remains is simply to extract the proper values from the rectangles to give the ranges.

To do this, we first invoke STEnvelope to ensure the rectangles are only represented by their corner points. Then we round the corner points to undo the effects of our buffer, and any issues with float representation. We also subtract 1 from the 'to' fields to undo what we did before converting to geometric points.

select      student,
            fromDate = round(minPt.STX,0),
            toDate = round(maxPt.STX,0) - 1,
            fromPeriod = round(minPt.STY,0),
            toPeriod = round(maxPt.STY,0) - 1
into        #normalized
from        #rectangles r
cross apply (select 
                minPt = r.rectangle.STPointN(1), 
                maxPt = r.rectangle.STPointN(3)
            ) corners
order by    student, fromDate, fromPeriod;

Optional - Visualize Before and After:

I've made it this far, so I minus well give a visual representation of the before and after results. Press the 'Spatial Results' tab in SSMS, choose 'student' as the label column.

Not the integer representation requires filling of a whole interval. So, for instance, a "to" date of 2 is really a "to" date of 2 to 2.99999.

select      student,
            unnormalized = 
                geometry::Point(fromDate, fromPeriod, 0).STUnion(
                    geometry::Point(toDate + 0.99, toPeriod + 0.99, 0)
                ).STEnvelope(),
            normalized = null
from        @schedule s

select      student,
            unnormalized = null,
            normalized = 
                geometry::Point(fromDate, fromPeriod, 0).STUnion(
                    geometry::Point(toDate + 0.99, toPeriod + 0.99, 0)
                ).STEnvelope()
from        #normalized;
Subdominant answered 19/12, 2019 at 22:33 Comment(0)
S
0

that is a very creative solution and an interesting read!!

A rather simplistic approach:

with 

    a as (    

        select student, fromdate from @schedule union
        select student, todate+1 from @schedule    

    ),

    b as (

        select   *, 
                 todate = (
                     select   min(aa.fromdate) 
                     from a as aa 
                     where aa.student = a.student 
                     and aa.fromdate > a.fromdate
                 ) - 1 
        from     a
    )

    select    *
    from      b
    where     exists (
                  select   * 
                  from     @schedule as s 
                  where    s.student = b.student 
                  and      s.fromdate < b.todate 
                  and      s.todate > b.fromdate
              );
Steakhouse answered 20/12, 2019 at 10:32 Comment(3)
Thanks Iptr. But this only does merging on the date range and ignores the period range. I need merging on the date and period ranges simultaneously. If you can extend your approach to work for that, that would be spectacular.Subdominant
I should add though that I like that this was posted so that people can see the one-dimensional equivalent of what I'm trying to do. I believe it's one of the approaches described in the links under 'My Research'.Subdominant
you can get access to the periods by replacing the exists with a cross apply (cross apply is an implicit exists) and then adjust the period limits accordingly. Also the LEAD() instead of the subquery to get the todate might be more performant.Steakhouse

© 2022 - 2024 — McMap. All rights reserved.