Remove rows with same keys
Asked Answered
F

4

5

I have data with rows like the following:

Name1  Name2  Name3  Col  
aaa    bbb    ccc    ...  
abc    ddd    ddd    1  
abc    ddd    ddd    2  
abc    ddd    ddd    3  
fff    fff    fff    ...  
ggg    ggg    hhh    4  
ggg    ggg    hhh    5  

(Name1, Name2 and Name3 are primary keys)

How do I remove the first rows from the set of data with the same 3 primary keys? (leaving only the last row of the set)

I.e. the result from the above would be:

Name1  Name2  Name3  Col  
aaa    bbb    ccc    ...  
abc    ddd    ddd    3  
fff    fff    fff    ...  
ggg    ggg    hhh    5  
Fetch answered 4/3, 2013 at 0:36 Comment(0)
P
5

Assuming your source data is in the correct order, and you want the last record in each set, there aren't any out-of-the-box transformations that would handle this situation. However, a Script Transformation can deal with it fairly easily.

Here's a sample Data Flow:

enter image description here

I'm using FF_SRC_AllRows and FF_DST_SelectedRows as flat file source and destination (respectively) for simplicity's sake, using the sample data you provided; your specific needs will vary. The script transformation SCR_SelectLastRow is configured as a transformation (both input and output):

enter image description here

Select all your input columns (with usage type ReadOnly):

enter image description here

Create one output (I named mine OutgoingRows, but you can name it whatever you wish), and set the SynchronousInputID property to None. This will let your script filter out the rows you don't want.

enter image description here

Add output columns corresponding to the input columns:

enter image description here

And use code along these lines:

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    class IncomingRowData
    {
        public string Name1;
        public string Name2;
        public string Name3;
        public string Col;
        public IncomingRowData(IncomingRowsBuffer Row)
        {
            Name1 = Row.Name1;
            Name2 = Row.Name2;
            Name3 = Row.Name3;
            Col = Row.Col;
        }
        public bool KeysDiffer(IncomingRowData other)
        {
            return (Name1 != other.Name1
                || Name2 != other.Name2
                || Name3 != other.Name3);
        }
        public void WriteToOutputBuffer(OutgoingRowsBuffer Row)
        {
            Row.AddRow();
            Row.Name1 = Name1;
            Row.Name2 = Name2;
            Row.Name3 = Name3;
            Row.Col = Col;
        }
    }

    private IncomingRowData _previousRow;

    public override void IncomingRows_ProcessInputRow(IncomingRowsBuffer Row)
    {
        if (_previousRow == null)
        {
            _previousRow = new IncomingRowData(Row);
        }
        IncomingRowData currentRow = new IncomingRowData(Row);
        if (currentRow.KeysDiffer(_previousRow))
        {
            _previousRow.WriteToOutputBuffer(this.OutgoingRowsBuffer);
        }
        _previousRow = currentRow;
    }

    public override void FinishOutputs()
    {
        if (_previousRow != null)
        {
            _previousRow.WriteToOutputBuffer(this.OutgoingRowsBuffer);
        }
        base.FinishOutputs();
    }
}

One nice thing about this technique is that it allows you to handle the data in a single pass and requires neither the use of staging tables nor keeping the entire source data set in memory. Depending on how big your data set is, either of those could cause significant performance issues.

Patricio answered 4/3, 2013 at 19:36 Comment(1)
Thank you for all your responses! I used Edmund Schweppe's answer, and my program worked! Thanks again.Fetch
F
1

Recommendation #1: Do this in your source query if at all possible.

Assuming that's not possible, and assuming that you always want the maximum value for Col to be selected, you can use the Aggregate component in your data flow.

Just add all of your columns to the aggregate input, and for the operations choose "Group By" for Name1, Name2, and Name3, and "Maximum" for Col.

Unfortunately, the aggregate component is an asynchronous component - meaning your entire flow will pause while the data flows into it, because it won't know the "maximum" value of each set until it's read in every row.

Frohman answered 4/3, 2013 at 0:43 Comment(1)
True; however, the OP was looking to get the last row, not the row with the maximum value.Patricio
F
1
SELECT name1, 
       name2, 
       name3, 
       col 
FROM   (SELECT name1, 
               name2, 
               name3, 
               col, 
               Max(rn) 
                 over ( 
                   PARTITION BY name1, name2, name3 ) AS max_rn, 
               rn 
        FROM   (SELECT name1, 
                       name2, 
                       name3, 
                       col, 
                       Row_number() 
                         over ( 
                           PARTITION BY name1, name2, name3 
                           ORDER BY col ) AS rn 
                FROM   test1)) 
WHERE  max_rn = rn; 

You can try this where test1 is tablename

Fiddling answered 4/3, 2013 at 1:52 Comment(1)
I like the use of PARTITION BY. However, the OP asked for the last row, not the one with the largest value of col. Further, while the odds are pretty good that there's a SQL Server database somewhere in the OP's environment, the original question doesn't mention a database at all, let alone one that supports PARTITION BY.Patricio
A
1

You need to group your data and select max Col value.

FLOW:

enter image description here

Data:

enter image description here

Aggregate element:

enter image description here

Result: enter image description here

If you using SQL Table and can write query:

SQLFIDDLEExample

SELECT Name1, Name2, Name3, MAX(Col) Col
FROM Table1
GROUP BY Name1, Name2, Name3

Result:

| NAME1 | NAME2 | NAME3 | COL |
-------------------------------
|   aaa |   bbb |   ccc | ... |
|   abc |   ddd |   ddd |   3 |
|   fff |   fff |   fff | ... |
|   ggg |   ggg |   hhh |   5 |
Astrolabe answered 5/3, 2013 at 6:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.