What is the ideal bulk size formula in ElasticSearch?
Asked Answered
S

8

25

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.

  • Number of nodes
  • Number of shards/index
  • Document size
  • RAM
  • Disk write speed
  • LAN speed

I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?

Sarmiento answered 28/8, 2013 at 13:3 Comment(0)
S
10

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

Shed answered 28/8, 2013 at 13:57 Comment(1)
Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?Polyanthus
C
12

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests

  • Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
  • Use bulk size in KiB (or equivalent), not document count !
  • Send data in bulk (no streaming), pass redundant info API url if you can
  • Remove superfluous whitespace in your data if possible
  • Disable search index updates, activate it back later
  • Round-robin across all your data nodes
Chancy answered 8/11, 2016 at 10:34 Comment(0)
S
10

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

Shed answered 28/8, 2013 at 13:57 Comment(1)
Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?Polyanthus
A
7

I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.

I'd suggest using BulkProcessor if you are using the Java API.

Alveraalverez answered 25/11, 2013 at 15:5 Comment(2)
That sounds a bit conservative, I've run indexing jobs via the http api with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServerConstantan
It's very conservative. However, you can't determine the ideal settings w/o testing with actual data on the actual cluster. These days (5 years later) we have a much larger and more powerful cluster using MUCH larger batch sizes in MBs with no document limit.Alveraalverez
T
6

I was searching about it and i found your question :) i found this in elastic documentation .. so i will investigate the size of my documents.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

Travesty answered 28/3, 2016 at 9:55 Comment(1)
That sounds a bit conservative (probably the intention), I run indexing jobs with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer (more below).Constantan
B
4

In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.

Bozeman answered 21/8, 2019 at 1:8 Comment(0)
C
1

I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.

In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:

Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.

I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.

The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.

Constantan answered 12/11, 2018 at 22:15 Comment(0)
C
1

Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents

An excerpt from link

How Big Is Too Big?
      The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
      Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
      It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.

Cowrie answered 29/7, 2021 at 16:33 Comment(0)
B
0

Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.

Bahr answered 6/8, 2022 at 14:9 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Phore

© 2022 - 2024 — McMap. All rights reserved.