Slow insert and update commands during mysql to redshift replication
Asked Answered
M

1

6

I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.

After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.

Is there anything that can be done for increasing the performance?

One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?


Edit 1: Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table

Mesothorax answered 16/11, 2017 at 15:22 Comment(5)
have you coded this yourself? you need to stream your binlog to s3 and then write something like a mini-batch script . OR jut use aws DMS which does all of that for you!!!!Papaveraceous
you MUST NOT use insert/update/delete statements for this - it will be too slow in all circumstances!Papaveraceous
earlier i was using aws dms only, but it's performance was not satisfactory. I have to drop and reload table everytime there is a error. Also i wanted to include some exception notifier for thisMesothorax
performance on DMS? what was the problem - it uses binlog for ongoing replication. i can help with the dms if you update your question accordingly. Tell me about your errors.Papaveraceous
edited the question regarding the dms performance issuesMesothorax
P
0

Here is what you need to do to get DMS to work

1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"

2) this will probably fail, do not worry. "stop" the dms task.

3) on redshift make the following changes to the table

  • Change all dates and timestamps to varchar (because the options used by dms for redshift copy cannot cope with '00:00:00 00:00' dates that you get in mysql)
  • change all bool to be varchar - due to a bug in dms.

4) on dms - modify the task to "Truncate" in "Target table preparation mode"

5) restart the dms task - full reload

now - the initial copy and ongoing binlog replication should work.

Make sure you are on latest replication instance software version

Make sure you have followed the instructions here exactly

http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html

If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)

Papaveraceous answered 17/11, 2017 at 10:21 Comment(2)
I have done all the steps you mentioned above, but for date field i cannot change it to varchar as i need to do query and grouping on the datesMesothorax
what i do is write DMS to a staging area, then i run some etl to copy to a "datawarehouse" area. I copy the dates from varchar to date using something like this case when datetime='0000-00-00 00:00:00' then null else datetime::timestamp end as datetime,Papaveraceous

© 2022 - 2024 — McMap. All rights reserved.