Scaling of ElasticSearch
Asked Answered
S

3

9

I'm searching for information on how ElasticSearch would scale with the amount of data in its indexes and am surprised how little I can find on that topic. Maybe some experience from the crowd here can help me.

We are currently using CloudSearch to index ≈ 7 million documents; in CloudSearch this results in 2 instances of type m2.xlarge. We are considering switching to ElasticSearch instead to reduce the cost. But all I find on the scaling of ElasticSearch is that it does scale well, can be distributed over several instances etc.

But what kind of machine (memory, disc) would I need for this kind of data?

How would that change if I increased the amount of data by the factor of 12 (≈ 80 million documents)?

Slater answered 30/4, 2013 at 12:5 Comment(1)
I guess with the information you provided you'll only get the typical "it depends" answer :) but still...it depends on what your documents look like and what you want to do with them.Vaasa
R
9

As Javanna said, it depends. Mostly on: (1) rate of indexing; (2) size of documents; (3) rate and latency requirements for searches; and (4) type of searches.

Considering this, the best we can help is giving examples. On our site (news monitoring) we:

  1. Index more than 100 docs per minute. We have, currently, near 50 million documents. I've also heard of ES indexes with hundreds of millions of documents.
  2. Documents are news articles with some metadata, not short but not that large.
  3. Our search latency varies between ~50ms (for normal and rare terms) up to 800ms for common terms (stopwords, we index them). This variation is largely due to our custom scoring (thanks to Lucene/ES support for customizing it) and to the fact the dataset (inverted lists) do not fit entirely in memory (OS cache). So when it hits a cached inverted list, it's faster.
  4. We do OR queries with a lot of terms which are one of the hardest. Also we do faceting on two single-valued fields. And have some experiments with date facet (to show rate of publication through time).

We do all this with 4 EC2's m1.large instances. And now we're planning moving to ES, just released, 0.9 version to get all the goodies and performance improvements of Lucene 4.0.

Now leaving examples aside. ElasticSearch is pretty scalable. It is very simple to create an index with N shards and M replicas, and then create X machines with ES. It will distribute all shards and replicas accordingly. You can change the number of replicas anytime you want (for each index).

One downside is that you can't change the number of shards after the index creation. But you can still "overshard" it beforehand to leave room for scaling when needed. Or create a new index with the right number of shards and reindex everything (we do this).

Finally, ElasticSearch (and also Solr) uses, under the hood, the Lucene Search library, which is very mature and well known library.

Rudy answered 30/4, 2013 at 19:36 Comment(2)
can you share how many shards are you using? by oversharding do you mean 10, 100, 1000?Crat
@Crat We now have 8 m1.large instances. We've moved to a date-partitioned scheme, where we have one or two shards per month. Before this, we had 16 shards per index (which covered the ~60M documents). By oversharding I mean something like doubling. But you have to be careful, depending on the setup, too many shards is not good. For example, with 8 m1.large instances if each query has to go to all 200 shards, the overhead won't be worth it.Rudy
I
7

I've actually recently switched from using CloudSearch to a hosted ElasticSearch service at the company I work for. Our specific application has a little over 100 million documents and is growing daily. So far, our experience with ElasticSearch has been absolutely wonderful. Search performance averages at ~250ms, even with all the sorting, filtering, and faceting. Indexing documents is also relatively fast, despite the several MB load we pass through HTTP with the bulk API every couple of hours. Refresh rates seem to be near instant, as well.

For our ~100M doc / 12GB index, we used 4 shards / 2 replicas (will bump to 3 replicas if performance degrades) spread across 4 nodes. Prior to setting up the index, our team spent a couple of days researching ElasticSearch cluster deployment/maintenance, and opted to use http://qbox.io to save money and time. We were paralyzingly afraid of performance and scale issues choosing to host our index on a dedicated cluster like Qbox, but so far the experience has been seriously fantastic.

Since our index lives on a dedicated cluster, we don't have access to nuts-and-bolts node-level configuration settings, so my technical expertise with ES deployment is still pretty limited. That being said, I can't be sure of exactly what performance tweeks are needed for the performance we've experienced on our index. However, I do know Qbox's cluster uses SSD... so that could definitely have a significant impact.

Point in case, ElasticSearch has scaled seamlessly. I highly, highly recommend the switch (even if it's just to save $$, CloudSearch is crazy expensive). Hope this information helps!

Izettaizhevsk answered 30/4, 2013 at 19:11 Comment(2)
Hang on a sec, you "opted" to use qbox.io? Your profile name says you are chief architect at qbox.io. A COI disclaimer would be helpful with this answer.Corolla
Nathan, good point. I posted this earlier last year when I had just started using Elasticsearch through Qbox. I was working with another local company which fizzled out, so I hopped on board at Qbox due to having some close friends who worked there. Since joining, we've dropped and added several engineers, and I've wound up as the lead dev. Consequently, my above opinion has only solidified (I have no shame :)).Izettaizhevsk
N
1

CloudSearch recently dropped prices and may now be a cheaper alternative than maintaining your own Search infrastrcuture on EC2 - http://aws.amazon.com/blogs/aws/cloudsearch-price-reduction-plus-features/

Nelia answered 1/12, 2014 at 1:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.