Automatically offload dynamo table to cloud search domain
Asked Answered
W

2

5

I'm using Dynamo DB pretty heavily for a service I'm building. A new client request has come in that requires cloud search. I see that a cloud search domain can be created from a dynamo table via the AWS console.

My question is this:

Is there a way to automatically offload data from a dynamo table into a cloud search domain via the API or otherwise at a specified time interval?

I'd prefer this to manually offloading dynamo documents to cloudsearch. All help greatly appreciated!

Weal answered 12/5, 2015 at 23:8 Comment(1)
I don't use those tools. I use the API directly. @BMWWeal
P
10

Here are two ideas.

  1. The official AWS way of searching DynamoDB data with CloudSearch

    This approach is described pretty thoroughly in the "Synchronizing a Search Domain with a DynamoDB Table" section of http://docs.aws.amazon.com/cloudsearch/latest/developerguide/searching-dynamodb-data.html.

    The downside is that it sounds like a huge pain: you have to either re-create new search domains or maintain an update table in order to sync, and you'd need a cron job or something to execute the script.

  2. The AWS Lambdas way

    Use the newish Lambdas event processing service. It is pretty simple to set up an event stream based on Dynamo (see http://docs.aws.amazon.com/lambda/latest/dg/wt-ddb.html).

    Your Lambda would then submit a search document to CloudSearch based on the Dynamo event. For an example of submitting a document from a Lambda, see https://gist.github.com/fzakaria/4f93a8dbf483695fb7d5

    This approach is a lot nicer in my opinion as it would continuously update your search index without any involvement from you.

Polley answered 13/5, 2015 at 3:41 Comment(8)
You should be able to find the pricing information pretty easily if you search for it.Polley
I'd rather do it the "hard" way than have to pay for it. Especially for such a simple task, makes no sense why this wouldn't be automatic.Weal
Your first 1 million requests per month are free and it's pretty cheap after that. No idea what kind of volume or budget you're dealing with but having always-up-to-date results and avoiding a messy cron job is worth something. aws.amazon.com/lambda/pricingPolley
Thanks a lot! But it seems that the dynamo db streams API isn't readily available yet :(Weal
I've been looking for the best way to do this and my research concurs with this answer by alexroussos. Which is a shame because the first solution is a pain, and streams and lambda have been in preview for months and cannot be relied upon in production. Ideally this is a feature AWS could add, it's a fairly generic use case that would be benefit all users of dynamodb and cloudsearch.Affirm
Correction: alexroussos is not talking about using dynamodb streams and lambda is out of preview, in that case yes this is probably the best solution for now even in production.Affirm
If I am not mistaken your second way (triggering lambda on each update to Dynamo) is not a good way to update the cloud search index since their documentation states "Make sure your [upload] batches are as close to the 5 MB limit as possible. Uploading a larger amount of smaller batches slows down the upload and indexing process." docs.aws.amazon.com/cloudsearch/latest/developerguide/… Triggering lambda on each update would cause lots of individual document updates instead of batching updates which will not work at scale.Enfilade
@NickolayKondratyev Waiting for a batch to fill up is also going to result in delays before your docs are indexing. Batching is an optimization that totally depends on the rate of updates in your system. Start simple and you can always add batching later if you need toPolley
S
2

I'm not so clear on how Lambda would always keep the data in sync with the data in dynamoDB. Consider the following flow:

  1. Application updates a DynamoDB table's Record A (say to A1)
  2. Very closely after that Application updates same table's same record A (to A2)
  3. Trigger for 1 causes Lambda of 1 to start execute
  4. Trigger for 2 causes Lambda of 2 to start execute
  5. Step 4 completes first, so CloudSearch sees A2
  6. Now Step 3 completes, so CloudSearch sees A1

Lambda triggers are not guaranteed to start ONLY after previous invocation is complete (Correct if wrong, and provide me link)

As we can see, the thing goes out of sync.

The closest I can think which will work is to use AWS Kinesis Streams, but those too with a single Shard (1MB ps limit ingestion). If that restriction works, then your consumer application can be written such that the record is first processed sequentially, i.e., only after previous record is put into CS, then the next record should be put.

Sisterly answered 5/8, 2017 at 10:13 Comment(11)
"Lambda triggers are not guaranteed to start ONLY after previous invocation is complete". I would also like to know if that's true because I think that I have this kind of problem at the moment.Pas
@sami_analyst: The answer I gave is pretty old, and I realized while having an exactly same use case as yours that there's something like DynamoDB Streams, which always ensures that items with a particular partition key would always go into a particular stream. I decided not to use Lambda, cause I preferred the dynamodb streams approach better. forums.aws.amazon.com/message.jspa?messageID=699134 So in all cases your data will be sharded by hash/partition key, and sorted by your range/sort key.Sisterly
So this means that with lambda there is also the possibility that a sequence of updates could be splitted in multiple lambda calls ? For now I solved my problem by using the records SequenceNumber property to order and afterwards merge the update records of the items with the same partition-key. If the records streams of the items with the same partition-key's are splitted in multiple lambda calls, I will have a problem in the near future ... And how do you process the DynamoDB Stream ? with lambda ? Thank you for the fast repsonse, this was really helpfull for me.Pas
@sami_analyst: With DynamoDB Streams, you can read data from any particular stream at any point onwards. So say there are 2 streams from position X and Y till which you'd processed your data. So there is a daemon which you can run which next time you start, you go and check in your checkpointing DB, to find streams and till which point those streams were processed. You then make API calls to fetch data from that point onwards on those streams. As could be understood, the checkpointing has to be done may be once every minute. More continued in the next comment...Sisterly
However, assume that the stream had 1000 entries, and you checkpoint after processing every 100 entries. Processing means reading the DynamoDB streams one record at a time, and then based on the type of record (ADD/EDIT/REMOVE) you perform corresponding CloudSearch operations.. NOW, say you had done processing till 523 records, which means you had saved checkpoint at 500th record, but then the daemon crashed. So when daemon restarts, daemon goes into checkpoint DB, and finds it needs to start from 500th entry.. now what? Continued..Sisterly
Fear not. Because even if the "processing" of records 501 to 523 is done the second time all over again, still the Cloud Search would ultimately be in its correct state. So the concept is similar to idempotent function in mathematics, if the function F(x) = y, F(F(x)) = y too. So the same operation can be performed multiple times, as long as the sequence in which you are performing the operations are in some order.Sisterly
DynamoDB streams would ensure the records that you receive from a stream always have the records in the same order as they were done on the DynamoDB. Also, all records from one partition key would always go into a particular shard. Think carefully, this wont be a limitation in most cases, since this means that if there are multiple streams, then a record A from Partition Key A and another record B from Partition Key B could be in different streams. Assume that record A change happened before record B in DynamoDB, It would not really matter in most applications if CloudSrch processes B before ASisterly
Since in most situations, you would design your DynamoDB table to have partition key based on may be UserID / Center ID / State ID etc. So even if modifications for User A and User B were done in CS in different order than on DynamoDB, it would not matter since ultimately the data belongs to different "entities". However, if User A's data was changed multiple times, you'd want to make sure those entries are processed in correct order. Get it? If not, let me knowSisterly
@ P.Prasad: I get it. Thank you for this precise description. This helped me a lot !Pas
@sami_analyst: Dynamo DB streams by default store data for 1 day (you can for a cost make it store for 7 days). But the default service itself is free (GET/PUT). So I'd highly recommend you try it out. If you're more interested in this design pattern, I'd recommend this phenomenal read engineering.linkedin.com/distributed-systems/… The same thing can be done in Mysql (surprise!) Using bin log replication streams. This is the direction a lot of applications are moving towardsSisterly
@ P.Prasad: thumbs up !Pas

© 2022 - 2024 — McMap. All rights reserved.