How to test (unit test) on ETL process?
Asked Answered
C

5

16

I know several small companies do not do testing on ETL process, but that seems to be suboptimal from the perspective of software engineering.

How do people usually do testing/unit test/functional test on ETL process?

Christianity answered 14/6, 2016 at 10:15 Comment(2)
You'd need to generate an empty test database, test cases (in data sources), run your ETL then test the resulting data in the target test database. Much more convoluted than a application unit test which is why it isn't done muchCeto
@Nick.McDermaid thanks for the informationChristianity
C
6

testing of an ETL is usually a problem. More precisely, testing isnt problem, problem is how to get reasonable test data. ETL is typically tested on production data. Aside of the security issue, the problem with production data is that does not cover functionality of ETL sufficiently (typically about 40% of business rules isnt covered by production data sample) and it takes too much of time to process.

Recently we have developed a test data generator (for more detail, please look for GTL QAceGen: Business Logic Driven Data Generator on Informatica Market Place) which generate test data based into source tables/files on business rule specification. Tool takes into consideration any foreign key applied and it works for any major ETL and/databases.

This tool helps to speed up testing cycle by at least 50% (compared to manual testing) an covers 100% of all business rules. It also generates quite detailed reports and more importantly, these tests can be repeated at any time (ie regression tests).

Cat answered 4/11, 2016 at 11:18 Comment(0)
I
12

We recently worked on a project where the governance board demanded 'You must have Unit Tests' and so we tried our best.

What worked for us was have each ETL solution start and end with a QA/Test package.

Anything unexpected discovered by these packages was logged into an audit table and a Fail Package event was then raised to stop the entire Job - We figured it was better to run with yesterdays good data than risk reporting against possible bad 'today' data.

The starting package would do db schema and data sanity checks. Data Sanity involved checking for duplicate or missing data caused by a lack of Referential Integrity in the source systems. Schema checks ensured that any schema changes that did not get applied during Continuous integration were detected.

The end package would check the results of any transformations. These included:

  • Comparing record counts between source|destination
  • Checking specific transforms (eg: all date values changed to appropriate SK value, all string values RTrimed)
  • Ensuring all SK fields were populated (-1 instead of nulls)

Most of these tests were SQL statements the used the built in schema objects of our database, so they were not to onerous to create.

In addition, as part of our development process we would create views that had the end result of any transformations we were doing. We would make use of these views to validate our package transformations.

Each of these checks created a record in our special audit table. That way we could provide a comprehensive list of all the tests and checks we had done each running of the process to satisfy the governance peoples.

(We also had a separate set of packages that would unit test each QA test by means of creating dummy tables, populating them, running the test then confirming the appropriate audit record was written. As Nick stated, this was a lot of work and of little real value)

Illegal answered 29/6, 2016 at 6:11 Comment(0)
G
8

You can unit test ETLs.

End-to-end tests are good, but slow, expensive and difficult to construct and keep stable.

Unit testing ETLs is highly desired to be able to test all data permutations but generally put into the too-hard basket. However it is possible to write true unit tests for ETLs that can run quickly and reliably.

We have found that the key is to decompose the ETL into two separate sections. Since an ETL is an Extract-Transform-Load the key is to separate the T from the E&L. Make a pure Transform function that transforms an input dataset to an output dataset, then call this function from the Extract and Load module.

The Extract and Load module isn't suitable for unit testing because it will generally involve external data sources and sinks, access tokens and user permissions, etc.

But all of the testable logic should be in the Transform component. Test this function from any unit testing framework - you will be able to pass in predefined datasets and test the transformed output against expected results. With some thinking we have even managed to create unit tests that test multi-stage updates of datasets onto each other.

Our particular implementation was done on Databricks in Scala, but the concept should work on any platform.

Geranium answered 16/8, 2020 at 23:23 Comment(1)
Keeping the T separate will promote modular code. How does the development practice evolve with running notebook style?Leontine
C
6

testing of an ETL is usually a problem. More precisely, testing isnt problem, problem is how to get reasonable test data. ETL is typically tested on production data. Aside of the security issue, the problem with production data is that does not cover functionality of ETL sufficiently (typically about 40% of business rules isnt covered by production data sample) and it takes too much of time to process.

Recently we have developed a test data generator (for more detail, please look for GTL QAceGen: Business Logic Driven Data Generator on Informatica Market Place) which generate test data based into source tables/files on business rule specification. Tool takes into consideration any foreign key applied and it works for any major ETL and/databases.

This tool helps to speed up testing cycle by at least 50% (compared to manual testing) an covers 100% of all business rules. It also generates quite detailed reports and more importantly, these tests can be repeated at any time (ie regression tests).

Cat answered 4/11, 2016 at 11:18 Comment(0)
B
3

We've set up a system where for each ETL procedure we have defined an input dataset and an expected result dataset. Then we have created a system which, utilizing Robot Framework, runs three-part tests for each ETL procedure where the first part inserts the input dataset into the source data tables, the second part runs the ETL, and the third part compares the actual results with our expected results.

This works pretty well for us, but there are a couple of downsides: first of all, we create the test datasets manually for each ETL procedure which takes some work, and secondly, this means that testing for "unexpected" inputs is not done.

For the automated unit testing we have a separate environment in which we can install builds of our entire DW automatically.

Bensen answered 1/7, 2016 at 4:19 Comment(0)
D
0

The testing in ETL process fits in the following stages:

  1. Identify Business requirements
  2. Validate Data sources
  3. Prepare test cases
  4. Extract Data from different sources
  5. Apply transformation logic to validate data
  6. Load data into the destination
  7. Reporting analysis

We can also categorize the ETL testing process as follows: Product validation Source to target data testing Metadata testing Performance testing Integration and quality testing Report testing

Dialectical answered 13/8, 2020 at 5:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.