Newly inserted or updated row count in pentaho data integration
Asked Answered
K

3

7

I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?

Kinslow answered 20/10, 2015 at 7:18 Comment(0)
B
5

I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.

Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.

In PostgreSQL, for instance, it would look like this:

/* Count affected rows from INSERT */
WITH inserted_rows AS (
    INSERT INTO ...
    VALUES
        ...
    RETURNING 1
)
SELECT count(*) FROM inserted_rows;

/* Count affected rows from UPDATE */
WITH updated_rows AS (
    UPDATE ...
    SET ...
    WHERE ...
    RETURNING 1
)
SELECT count(*) FROM updated_rows;

However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.

Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.

EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.

  • Transformation I - extract data from source
    • Get the data from the source, be it a database or anything else
    • Prepare it for output in a way that it fits the target DB's structure
    • Save a CSV file using the text file output step on the file system
  • Parent Job
    • If the PDI server is the same as the target DB server:
      • Use the Execute SQL Script step to:
        • Read data from the file and perform the INSERT/UPDATE
        • Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
    • If the PDI server is NOT the same as the target DB server:
      • Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
      • Use the Execute SQL Script step to:
        • Read data from the file and perform the INSERT/UPDATE
        • Write the number of affected rows into a table

EDIT 2: another suggested solution

As suggested by @user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).

The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.

Eventually it could look like so (note that this is just the comparison and counting part): field compare

Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.

Beane answered 20/10, 2015 at 8:2 Comment(0)
B
3

The "Compare Fields" step will take 2 streams as input for comparison, and its output is 4 distinct streams for "Identical", Changed", "Added", and "Removed" records. You can count those 4, and then process the "Changed", "Added", and "Removed" records with an Insert/Update.

Betthezul answered 21/10, 2015 at 15:30 Comment(1)
Correct @user3123116, there's that solution, too. Some versions don't have this step out of the box and you should also note that this requires an extra query on the target data and therefore will be less performant.Beane
B
2

You can do it from the Logging option inside the Transformation settings. Please follow the below steps :

  1. Click on Edit menu --> Settings
  2. Switch to Logging Tab
  3. Select Step from the left menu
  4. Provide the Log Connection & Log table name(Say StepLog)
  5. Select the required fields for logging(LINES_OUTPUT - for inserted count & LINES_UPDATED - for updated count)
  6. Click on SQL button and create the table by clicking on the Execute button
  7. Now all the steps will be logged into the Log table(StepLog), you can use it for further actions.
  8. Enjoy
Burress answered 26/10, 2015 at 6:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.