Best way to store chat messages in elasticsearch
Asked Answered
S

1

11

We are currently implementing an Instant Messaging system on our platform. We need to provide our users a chat history and be able to show the last 5 conversations that user had ( preview like on facebook).

ipso facto we necessarily need to think about how we can store all these data.

We are using Elasticsearch and we think that this could be a reliable solution to store chat messages and make them highly available for read operations.

Our question is, what would be the best data structure within Elasticsearch so that our read operations can be fast and not too heavy.

We thought of a lot of solution and this may be the best we came up with.

Our message representation could be :

{ 
   "ID" : 1,
   "sender" : "john",
   "receiver" : "doe",
   "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit."
   "date" : "timestamp"
}

We could use nested object to store messages within a conversation :

 {
     "ID" : 317,
     "participants" : "john, doe",
     "date" : "timestamp of the last received message",
     "messages": [
         {
            "ID": "49753",
            "sender" : "john", 
            "receiver" : "doe",
            "content" : " Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
            "date" : "timestamp" 
         },
         {
            "ID": "49754",
            "sender" : "doe", 
            "receiver" :"john",
            "content" : " Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
            "date" : "timestamp" 
         },....
               ]
}

We would like to have your feedback on this solution and also have your solutions if you have any better.

Thanks in advance

Statute answered 14/9, 2016 at 10:57 Comment(0)
T
19

Note: This suggested solution is not only from the perspective of fast-reads (as requested by OP), but also with an eye toward minimizing indexing overhead. Nested documents and their parents are written as a single block, so the addition of each additional "message" in the nested proposal would cause all previous message and conversation data in that conversation to be reindexed as well.

Here's my guess about Facebook's general approach to implementing Messages (if you were to do something similar using Elasticsearch)

enter image description here

Preview: (In Messages navbar dropdown, and on the left rail of the Messages page)

Shows a summary of the most recent conversations using:

  • Composite headshots of the three most recent participants in the - Ordered list of most recent three conversation participants.
  • Number of additional participants if > 3
  • Timestamp of most recent message in the conversation
  • Snippet of the latest message in the conversation

Message Pane: (Center column of the Messages page)

  • Shows all the messages in a conversation
  • The Message Pane also repurposed for Message Search results, showing all messages containing the searched term.

Search Box:

  • Typeahead: (completes on conversations using matching participant names)
  • Search: (searches on messages using matching text from message body)

The data structure driving the preview, would probably be in a conversation index (containing one document per conversation). These documents would be updated each time a message is added to a conversation. (Much like the parent record of your nested example doc).

This conversation data source is only used to draw the previews (fast filtering on conversation participants to ensure that you only see conversations you are a part of).

 {
     "ID" : 317,
     "participant_ids": [123456789, 987654321],
     "participant_names: ["John Doe", "Jane Doe"],
     "last_message_snippet" : " Lorem ipsum dolor sit amet, consectetur adipiscing elit...",
     "last_message_timestamp" : "timestamp of the last received message",
 }

There would be no nesting here b/c only the up-to-date conversation summary is needed, not the message.

Performance would be fast, because no scoring need take place, just a filter on [current user] in participant_ids and a descending sort by last_message_timestamp.

You could replicate the typeahead functionality using the Elasticsearch Term Suggester on the participant_names field.

The lower-number of conversation documents (vs message documents) would help an index updated this frequently function well at scale.

To further scale this functionality, an Index Per Timeframe indexing strategy could be used (with the timeframe being determined by say, the typical half-life of a conversation, as an example).


When displaying the Messages within a particular conversation, you'd be querying a message index carrying your message document example, but with a reference to the conversation

 {
     "ID" : 4828274,
     "conversation_id": 317,
     "conversation_participant_ids": [123456789, 987654321],
     "sender_id": 123456789,
     "sender_name: "John Doe",
     "message" : " Lorem ipsum dolor sit amet, consectetur adipiscing elit",
     "message_timestamp" : <timestamp>,
 }

Performance would be fast, because no scoring need take place, just a filter on conversation_id and a descending sort by message_timestamp.

When searching Messages across conversations, you'd only need to index the message field. (Following the Facebook implementation).

The the search query would be the search term filtered by [current user] in conversation_participant_ids with a descending sort by message_timestamp.

To minimize cross-talk in the search cluster when retrieving the messages for a conversation, you'd want to be sure to take advantage of Elasticsearch's routing parameter (on indexing requests) to explicitly co-locate all messages for a conversation on the same shard, using the conversation_id as the routing value when indexing new messages.


Note: Elasticsearch may turn out to be overkill for implementing a solution that could largely be built off of another document store or relational database with text-search functionality. By normalizing conversation and message in the above example, there is no longer any dependence on "nesting" in Elasticsearch.

Elasticsearch strengths for this implementation include efficient caching of filtered search results, fast autocomplete, and fast text search, but a weakness of Elasticsearch is the need for enough memory to comfortably accommodate all of the indexed data.

The performance characteristics of a Messaging application dictate that only the most recent messages are likely to be accessed or searched with any frequency, so at some point, if your application needs to scale, you should plan out a way to archive older, not-recently-accessed messages in "cold-storage" such that they require fewer application resources, but can still be "thawed" quickly enough to serve a keyword search without excessive latency.

Triptolemus answered 14/9, 2016 at 19:31 Comment(7)
Thank you for the complete answer, I'll put that at use. The ensuing question is how to manage unread messages when the user is offline...Statute
You're thinking about how to store chat messages and alert the recipient? Not sure that you'd want to do much differently than with messages received while online (maybe a Boolean viewed flag?)Triptolemus
How do you suggest we would do that ? We were thinking about storing the unread messages like in a Redis list and when the user comes back online, POP unread messages from the list and finally index them. I don't really get how can get the same effect with a viewed flag, we would need a call back on when the user opens the conversation.Statute
Why wouldn't you index them when sent? Do you expect the majority of messages to remain unread for long periods of time?Triptolemus
We would but we have to know one way or another if the message was read to mark it as viewed or not. I will probably understand your point when we put that at test.Statute
Why not stick with your Redis list idea for read/unread display state, but index unread messages immediately after they are sent to avoid unnecessary latency around message availability/searchability when the user comes back online.Triptolemus
This should do the trick, I will let you know, thanks again for the help.Statute

© 2022 - 2024 — McMap. All rights reserved.