SSIS Package runs for 500x longer on one server
Asked Answered
G

2

10

I have an SSIS package--two data flow tasks, 8 components each, reading from two flat files, nothing spectacular. If I run it in BIDS, it takes reliably about 60 seconds. I have a sandbox DB server with the package running in a job which also takes reliably 30-60 seconds. On my production server, the same job with the same package takes anywhere from 30 seconds to 12 hours.

With logging enabled on the package, it looks like it bogs down--initially at least--in the pre-execute phase of one or the other (or both) data flow tasks. But I can also see the data coming in--slowly, in chunks, so I think it does move on from there later. The IO subsystem gets pounded, and SSIS generates many large temp files (about 150MB worth--my input data files are only about 24MB put together) and is reading and writing vigorously from those files (thrashing?).

Of note, if I point my BIDS instance of the package at the production server, it still only takes about 60 seconds to run! So it must be something with running dtexec there, not the DB itself.

I've already tried to optimize my package, reducing input row byte size, and I made the two data flow tasks run in series rather than in parallel--to no avail.

Both DB servers are running MSSQL 2008 R2 64-bit, same patch level. Both servers are VMs on the same host, with the same resource allocation. Load on the production server should not be that much higher than on the sandbox server right now. The only difference I can see is that the production server is running Windows Server 2008, while the sandbox is on Windows Server 2008 R2.

Help!!! Any ideas to try are welcome, what could be causing this huge discrepancy?

Appendix A

Here's what my package looks like…

The control flow is extremely simple:

Control Flow

The data flow looks like this:

Data Flow

The second data flow task is exactly the same, just with a different source file and destination table.

Notes

The completion constraint in the Control Flow is only there to make the tasks run serially to try and cut down on resources needed concurrently (not that it helped solve the problem)…there is no actual dependency between the two tasks.

I'm aware of potential issues with blocking and partially-blocking transforms (can't say I understand them completely, but somewhat at least) and I know the aggregate and merge join are blocking and could cause problems. However, again, this all runs fine and quickly in every other environment except the production server…so what gives?

The reason for the Merge Join is to make the task wait for both branches of the Multicast to complete. The right branch finds the minimum datetime in the input and deletes all records in the table after that date, while the left branch carries the new input records for insertion--so if the right branch proceeds before the aggregate and deletion, the new records will get deleted (this happened). I'm unaware of a better way to manage this.

The error output from "Delete records" is always empty--this is deliberate, as I don't actually want any rows from that branch in the merge (the merge is only there to synchronize completion as explained above).

See comment below about the warning icons.

Grearson answered 16/12, 2013 at 8:24 Comment(13)
Have you rebuild indexes and column statistics? Do the SQL Servers have the same config?Rakel
At the moment, there are no indexes on the two tables in question (our writing is currently more expensive than any of our reads)…and yes, I mean there's no primary key either, this is imported data from a vintage 70's system and its a mess. Remember though, running the package from BIDS importing data to the same server has no performance problem--so I'm reluctant to look to the data. Config the same how so? As far as I know everything relevant is the same, what should I check in particular?Horsetail
Pop a screen shot of what these packages actually look like please. Control + data flowAlishiaalisia
@S'pht'Kr:did you check the warning in your package?Unstrung
@VijaykumarHadalgi: Yes, but just re-checked them. On the Flat File Source, it was a metadata out of sync, that was because I shortened some of the fields (to reduce row memory size)--fixed. On Delete Records, it's "No rows will be sent to error output", which is expected. On the Destination, it's a truncation warning, which is fine. The warnings in Control Flow are just a rollup of the warnings from inside the Data Flow tasks. Warnings are the same in both Data Flow tasks.Horsetail
I'm not a SSIS guy but found it by accident another day, hope it can direct to the right direction. sqlservercentral.com/Forums/Topic472867-148-2.aspxRennin
You probably already did this: open the editor of your "OLE DB Destination" and check that your "Data access mode:" is "Table of view - fast load" with a check next to "Table lock" and a huge number in "Maximum insert commit size" of 2147483647.Unbidden
You can rule out the "OLE DB Destination" and "Flat File Destination" as the bottle neck by removing them. Replace the "OLE DB Destination" with a Union All Transformation that connects to nothing down stream. If this resolved your runtime issue, you know that the bottleneck must be in one of those two.Unbidden
If you have logging turned on, preferably to SQL Server, add the OnPipelineRowsSent event. You can then determine where it is spending all of its time. See this post Your IO subsystem getting slammed and generating all these temp files is because you are no longer able to keep all the information in memory (due to your async transformations)Alishiaalisia
@Alishiaalisia Good stuff! I had enabled logging for OnProgress, but adding OnPipelineRowsSent and using the scripts at that page (though I had to update for SSIS in 2008 vs. 2005 I think) are giving me some more visibility. It looks like the bog down is in my "Merge Join" component, which is processing ~60 records per second. I'll have to look into the Buffer settings again…but I thought it would automatically scale up (and my whole dataset should fit in under the 100MB limit). Still baffled as to why this only happens on the one server!Horsetail
@Alishiaalisia Could you post your last comment as an answer? Ultimately this led me to the solution.Horsetail
@Grearson so what was the solution? Can you add that?Sirenasirenic
@Sirenasirenic Elaborated in a comment on the accepted answer...Horsetail
A
4

If you have logging turned on, preferably to SQL Server, add the OnPipelineRowsSent event. You can then determine where it is spending all of its time. See this post Your IO subsystem getting slammed and generating all these temp files is because you are no longer able to keep all the information in memory (due to your async transformations).

The relevant query from the linked article is the following. It looks at events in the sysdtslog90 (SQL Server 2008+ users substitute sysssislog) and performs some time analysis on them.

;
WITH PACKAGE_START AS
(
    SELECT DISTINCT
        Source
    ,   ExecutionID
    ,   Row_Number() Over (Order By StartTime) As RunNumber
    FROM
        dbo.sysdtslog90 AS L
    WHERE
        L.event = 'PackageStart'
)
, EVENTS AS
(
    SELECT
        SourceID
    ,   ExecutionID
    ,   StartTime
    ,   EndTime
    ,   Left(SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)), CharIndex(':', SubString(message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, CharIndex(':', message, 56) + 1) + 1) + 1) + 2, Len(message)) ) - 2) As DataFlowSource
    ,   Cast(Right(message, CharIndex(':', Reverse(message)) - 2) As int) As RecordCount
    FROM
        dbo.sysdtslog90 AS L
    WHERE
        L.event = 'OnPipelineRowsSent'
)
, FANCY_EVENTS AS
(
    SELECT
        SourceID
    ,   ExecutionID
    ,   DataFlowSource
    ,   Sum(RecordCount) RecordCount
    ,   Min(StartTime) StartTime
    ,   (
            Cast(Sum(RecordCount) as real) /
            Case
                When DateDiff(ms, Min(StartTime), Max(EndTime)) = 0
                    Then 1
                Else DateDiff(ms, Min(StartTime), Max(EndTime))
            End
        ) * 1000 As RecordsPerSec
    FROM
        EVENTS DF_Events
    GROUP BY
        SourceID
    ,   ExecutionID
    ,   DataFlowSource
)
SELECT
    'Run ' + Cast(RunNumber As varchar) As RunName
,   S.Source
,   DF.DataFlowSource
,   DF.RecordCount
,   DF.RecordsPerSec
,   Min(S.StartTime) StartTime
,   Max(S.EndTime) EndTime
,   DateDiff(ms, Min(S.StartTime)
,   Max(S.EndTime)) Duration
FROM
    dbo.sysdtslog90 AS S
    INNER JOIN
        PACKAGE_START P
        ON S.ExecutionID = P.ExecutionID
    LEFT OUTER JOIN
        FANCY_EVENTS DF
        ON S.SourceID = DF.SourceID
        AND S.ExecutionID = DF.ExecutionID
WHERE
    S.message <> 'Validating'
GROUP BY
    RunNumber
,   S.Source
,   DataFlowSource
,   RecordCount
,   DF.StartTime
,   RecordsPerSec
,   Case When S.Source = P.Source Then 1 Else 0 End
ORDER BY
    RunNumber
,   Case When S.Source = P.Source Then 1 Else 0 End Desc

, DF.StartTime , Min(S.StartTime);

You were able to use this query to discern that the Merge Join component was the lagging component. Why it performs differently between the two servers, I can't say at this point.

If you have the ability to create a table in your destination system, you could modify your process to have two 2 data flows (and eliminate the costly async components).

  1. The first data flow would take the Flat file and Derived columns and land that into a staging table.
  2. You then have an Execute SQL Task fire off to handle the Get Min Date + Delete logic.
  3. Then you have your second data flow querying from your staging table and snapping it right into your destination.
Alishiaalisia answered 24/12, 2013 at 14:55 Comment(1)
Yep, you were right about not being able to hold everything in memory. But most of the stuff around the 'net about optimizing your buffers talks about maxing them out, size-wise...in my case, making the buffers smaller did the trick. My inference is that because memory pressure conditions were greater on the production server the buffers weren't fitting in memory. So, I actually reduced DefaultBufferMaxRows from the default 10,000 to 1,000. If I had the RAM headroom, the larger buffers would probably make it faster--but since the buffers were too big, they wedged up the whole system.Horsetail
S
-1

The steps below will help improve your SSIS performance.

  • Ensure connection managers are all set toDelayValidation ( = True).
  • Ensure that ValidateExternalMetadata is set to false
  • DefaultBufferMaxRows and DefaultBufferSize to correspond to the table's row sizes
  • Use query hints wherever possible http://technet.microsoft.com/en-us/library/ms181714.aspx
Syllabi answered 20/12, 2013 at 14:54 Comment(3)
Okay, 1: Did that already. 2: Hmm…haven't tried that, will give it a shot. 3: Can you elaborate on that just a bit? I have already optimized my row sizes down so that more rows will fit in the buffer…but am unclear how to do this effectively…I looked into this once, but came away with the conclusion that slimming down the row byte size was really the only important thing here. 4: My only SQL command in the whole task is: "DELETE FROM EventTable WHERE ApptTime >= ?"…so I'm not sure how much I can optimize that.Horsetail
Documentation for "ValidateExternalMetadata" states: "Gets or sets a value that controls whether a component validates its column metadata against its external data source at design time." Doesn't that mean it shouldn't have any effect at runtime/outside of BIDS? And did you mean to do that on every single component? Or sources and destinations?Horsetail
To be totally honest, I am not extremely familiar with the ValidateExternalMetadata properly. This article seems to shed some light on some perf tweaks that may help you further. technet.microsoft.com/library/Cc966529Syllabi

© 2022 - 2024 — McMap. All rights reserved.