how to load data faster with talend and sql server
Asked Answered
L

8

8

I use Talend to load data into a sql-server database.

It appears that the weakest point of my job is not the dataprocessing, but the effective load in my database, which is not faster than 17 rows/sec.

The funny point is that I can launch 5 jobs in the same time, and they'll all load at 17rows/sec .

What could explain this slowness and how could I improve the speed?

Thanks

New informations:

The transfer speed between my desktop and the server is about 1MByte

My job commits every 10 000

I use sql server 2008 R2

And the schema I use for my jobs is like this:

enter image description here

Lacee answered 14/4, 2014 at 15:30 Comment(7)
Does Talend have a "Bulk Load", "Bulk Insert" or "Bulk Copy" feature?Luisaluise
And check your hardware. 17 is pathetic even for a single threaded individual sql statement thing. Maybe your server is more like a mobile phone or a laptop. What is the latency between the app and the server?Voccola
slowness could be due to any factor, like network speed between talend and your DB, the way job is written in talend, talend does have bulk load components for sql server, but with normal sql server components the performance you are experiencing is very poor. I would suggest you mention more details like what is sql you are running, is there any kind of processing being done on talend side, what talend components you are using in job etc..Uniocular
You should search the commit every 1000 feature. It's available on most of the outputs.Artless
I already commit every 10 000 . And my server is not super strong, but it's not a smartphone. I'll try to get more specific in my questionLacee
Can you add a screenshot of your job setup and screenshots of how you have your SQL server components configured?Babiche
Can we also see the configuration of the tXMLMap? I can't see a reason why this is not simply a tMap as neither the input nor output appear to be XML but that shouldn't be your performance problemBabiche
B
15

Database INSERT OR UPDATE methods are incredibly costly as the database cannot batch all of the commits to do all at once and must do them line by line (ACID transactions force this because if it attempted to do an insert and then failed then all of the other records in this commit would also fail).

Instead, for large bulk operations it is always best to predetermine whether a record would be inserted or updated before passing the commit to the database and then sending 2 transactions to the database.

A typical job that needed this functionality would assemble the data that is to be INSERT OR UPDATEd and then query the database table for the existing primary keys. If the primary key already exists then you can send this as an UPDATE, otherwise it is an INSERT. The logic for this can be easily done in a tMap component.

Insert or Update Job Example

In this job we have some data that we wish to INSERT OR UPDATE into a database table that contains some pre-existing data:

Initially loaded data

And we wish to add the following data to it:

Insert or Update data

The job works by throwing the new data into a tHashOutput component so it can be used multiple times in the same job (it simply puts it to memory or in large instances can cache it to the disk).

Following on from this one lot of data is read out of a tHashInput component and directly into a tMap. Another tHashInput component is utilised to run a parameterised query against the table:

Parameterised Query Parameter Config

You may find this guide to Talend and parameterised queries useful. From here the returned records (so only the ones inside the database already) are used as a lookup to the tMap.

This is then configured as an INNER JOIN to find the records that need to be UPDATED with the rejects from the INNER JOIN to be inserted:

tMap configuration

These outputs then just flow to separate tMySQLOutput components to UPDATE or INSERT as necessary. And finally when the main subjob is complete we commit the changes.

Babiche answered 15/4, 2014 at 13:26 Comment(7)
Thanks, I'll look into it and come back with my solution (and time results)Lacee
I'm just building a quick job to show you how it might be done. About to upload screenshots so that might help you.Babiche
This job looks nice, but with a staging table you can do the same I think. Check my answer.Artless
Thank yo uso much for taking the time to show your job. I've tried to do the same but I'm a bit confused. What kind of items are "Records in DB" "Insert" and "Update" items? The one with a table and a Magnifiying glass? I'm also confiused by the tmysqlcommit, what is the point of this item? Thanks again =)Lacee
That component is just a tLogRow. You can use it to throw the data to the console and is pretty useful for debugging things and seeing the state of the data as it passes through your job. It obviously has performance concerns though so in production you'll want to deactivate them where possible or remove them where this isn't possible (components that rely on strictly bounded input row names will fail because they're expecting the data from the tLogRow). the tMySQLCommit is used to commit the data, you should also have an ON ERROR linked tMySQLRollback component for ACID transactionsBabiche
Thanks for your answer and sorry for the delay I was in hollyday. I'm still encountering problems with that job. Is there any way that I could share my job or screenshots to show you where I'm stuck? It does not seem possible right here :/ Thanks again so much for your helpLacee
@Lacee you could join me in chat I guess. I've not actually used the SE chat much before but I've just thrown up a Talend room and I'm sat in it now. We should be able to discuss your problem more there and then I can update my answer if it might help other people with a similar problemBabiche
N
4

I think that @ydaetskcoR 's answer is perfect on a teorical point of view (divide rows that need Insert from those to Update) and gives you a working ETL solution useful for small dataset (some thousands rows).

Performing the lookup to be able to decide wheter a row has to be updated or not is costly in ETL as all the data is going back and forth between the Talend machine and the DB server.

When you get to some hundred of thousands o even millions of records you have to pass from ETL to ELT: you just load your data to some temp (staging) table as suggested from @Balazs Gunics and then you use SQL to manipulate it.

In this case after loading your data (only INSERT = fast, even faster using BULK LOAD components) you will issue a LEFT OUTER JOIN between the temp table and the destination one to divide the rows that are already there (need update) and the others.

This query will give you the rows you need to insert:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS NULL

This other one the rows you need to update:

SELECT staging.* FROM staging
LEFT OUTER JOIN destination ON (destination.PK = staging.PK)
WHERE destination.PK IS   NOT    NULL

This will be orders of magnitude faster than ETL, BUT you will need to use SQL to operate on your data, while in ETL you can use Java as ALL the data is taken to the Talend server, so often is common a first step on the local machine to pre-process the data in java (to clean and validate it) and then fire it up on the DB where you use join to load it in the right way.

Here are the ELT JOB screen shots. INSERT or UPDATE ELT job

How to distinguish between rows to insert or update

Nutrient answered 31/10, 2014 at 10:47 Comment(0)
L
0

I've found where this performance problem come form.

I do an INSERT OR UPDATE, if I replace it with a simple INSERT, the speed goes up to 4000 rows/s.

Does it seem like an acceptable pace?

Anyway, I need my INSERT OR UPDATE so, I guess I'm stuck.

Lacee answered 15/4, 2014 at 8:10 Comment(2)
I'm not familiar with Talend but can you split into two seperate INSERT/UPDATE steps/jobs? One wholly doing inserts, the other just doing updates (for records which you have pre checked for update)Broderic
What about using "Insert or Update on duplicate key or unique index"? This executes the MySQL command INSERT ... ON DUPLICATE KEY INSERT. Unfortunately I have not found a way (yet) to make this operate on more than one row at once, but at least it should execute a lot faster than "Insert or Update" since reads are not required.Piracy
M
0

Based on your note that inserts are an order of magnitude faster than updates (4000 vs 17/sec) - It looks like you need to look at your DB indexes. Adding an index that matches your update parameters could speedup your updates significantly. Of course, this index may slow your inserts a bit.

You can also look at the query execution plan for your update query to see if it is using any indexes. How do I obtain a Query Execution Plan?

Montserrat answered 16/4, 2014 at 11:32 Comment(2)
The real problem that you can't bulk your UPSERTS. So you need to insert/update 1 row at a time.Artless
True. But 17/sec updates performance itself can be easily increased by the right index without increasing complexity.Montserrat
A
0

You should do a staging table, where you insert the rows.

Based on this staging table you do a DELETE query with t*SQLrow.

DELETE FROM target_table
WHERE target_table.id IN (SELECT id FROM staging_table);

So the rows you wanted to update are no longer exists.

INSERT INTO target_table 
SELECT * FROM staging_table;

This will move all the new/modified rows.

Artless answered 17/4, 2014 at 7:26 Comment(1)
It's a nice idea but you lose ACID transactions this way. What happens if your job falls over for whatever reason after deleting the rows? For this to work you MUST commit the delete before you do your insert so you can't rollback on a job failure. If you don't care about this then yeah it's a simpler and I think nicer way to do this but I would always want ACID transactions for things like this.Babiche
Z
0

I was having the same issue loading data into a DB2 server. I too had the commit set at 10000 but once I selected the option to batch(on the same component options screen) performance dramatically improved. When I moved the commit and batch to 20000 the job went from 5 hours to under 2 minutes.

Zigzagger answered 29/5, 2015 at 19:55 Comment(0)
Z
0

I had the same problem and solved it by defining an index on target table.

Usually, the target table has an id field which is its primary key and hence indexed. So, all sort of joins with it would work just fine. But the update from a flat file is done by some data fields, so each update statement have to make full table scan.

The above also explains why it works fast with INSERT and becomes slow with INSERT OR UPDATE

Zigmund answered 8/5, 2018 at 22:7 Comment(0)
C
0

Recommend a couple simple things:

  1. Where possible / reasonable, change from ETL to ELT.
  2. Set up a CDC process and only handle the changes. Depending on the database and needs, this can be handled (a) directly on the database, (b) through automated Talend functionality (need a subscription), (c) manually via SQL (full outer join) and a custom Java function that generates an MD5 hash, or (d) manually via SQL (full outer join) and the tAddCRCRow component.
  3. Where possible, load multiple tables concurrently.
  4. Where possible, use bulk loading for tables.
  5. Sometimes, a clear and load is acceptable as an approach and faster than checking for updates.
Colporteur answered 2/11, 2018 at 15:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.