How to do Real-time loading into Amazon Redshift?
Asked Answered
A

2

10

We are evaluating Amazon Redshift for real time data warehousing.

Data will be streamed and processed through a Java service and it should be stored in the database. We process row by row (real time) and we will only insert one row per transaction.

What is best practice for real time data loading to Amazon Redshift?

Shall we use JDBC and perform INSERT INTO statements, or try to use Kinesis Firehose, or perhaps AWS Lambda?

I'm concerned about using one of these services because both will use Amazon S3 as a middle layer and perform the COPY command which is suitable for bigger data sets, not for "one-row" inserts.

Auerbach answered 14/1, 2017 at 19:15 Comment(0)
T
16

It is not efficient to use individual INSERT statements with Amazon Redshift. It is designed as a Data Warehouse, providing very fast SQL queries. It is not a transaction-processing database where data is frequently updated and inserted.

The best practice is to load batches (or micro-batches) via the COPY command. Kinesis Firehose uses this method. This is much more efficient, because multiple nodes are used to load the data in parallel.

If you are seriously looking at processing data in real-time, then Amazon Redshift might not be the best database to use. Consider using a traditional SQL database (eg those provided by Amazon RDS), a NoSQL database (such as Amazon DynamoDB) or even Elasticsearch. You should only choose to use Redshift if your focus is on reporting across large volumes of data, typically involving many table joins.

As mentioned in the Amazon Redshift Best Practices for Loading Data:

If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.

Todtoday answered 15/1, 2017 at 8:18 Comment(5)
We have about 1TB of data in fact table with few dimension tables (100k-600k rows) and AWS Redshft seems like proper solution. However, real-time loading is must.Auerbach
Another option is Amazon Athena (currently in limited roll-out) that uses Presto to scan data stored in Amazon S3. You simply store your data in S3, then run an Athena query across it. Works best if you can compress and partition data, or store it in ORC/Parquet format. Only good for queries, not updates -- you update data by storing it in S3 directly.Todtoday
Totally agree with your answer. 15mins loading I am trying to make it work but the problem I face is when 400-500 tables need to be loaded in 15mins. If every load takes 10-35second, it just is not possible with total 4 node slices. how can I make it possible?Aleras
@AlokKumarSingh Please create a new Question with your full details, rather than asking via a comment on an old question.Todtoday
We opensourced our solution: github.com/practo/tipoca-streamAleras
H
6

The best option is Kinesis Firehose, which is working on batches of events. You are writing the events into Firehose, one by one, and it is batching it in an optimal way, based on your definition. You can define how many minutes to batch the events, or the size of the batch in MB. You might be able to insert the event faster into Redshift with INSERT, but this method is not scalable. COPY designed to work in almost every scale.

Hamill answered 15/1, 2017 at 6:7 Comment(3)
Thanks, but it will not be real time, it will have lag up to 60s (minimum buffer), am I right?Auerbach
@Auerbach that is correct. Redshift is not the optimal db for real-time data loading.Achilles
We opensourced our solution to this: github.com/practo/tipoca-stream. Cloudnative solution, you get a Kubernetes RedshiftSink Operator with masking support.Aleras

© 2022 - 2024 — McMap. All rights reserved.