I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types.
Throughput is very important. I need to be able parse these files in a reasonable amount of time, but at the same time, accuracy is also pretty important. I hope to have less than 10% of the documents parsed fail. (And by fail I mean fail due to tika stability, like a timeout while parsing. I do not mean fail due to the document itself).
My question - how to configure Tika Server in a containerized environment to maximize throughput?
My environment:
- I am using Openshift.
- Each tika parsing pod has CPU: 2 cores to 2 cores, and Memory: 8 GiB to 10 GiB.
- I have 10 tika parsing pod replicas.
On each pod, I run a java program where I have 8 parse threads.
Each thread:
- Starts a single tika server process (in spawn child mode)
- Tika server arguments:
-s -spawnChild -maxChildStartupMillis 120000 -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500 -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures -enableFileUrl
- Tika server arguments:
- The thread will now continuously grab a file from the files-to-fetch queue and will send it to the tika server, stopping when there are no more files to parse.
Each of these files are stored locally on the pod in a buffer, so the local file optimization is used:
The Tika web service it is using is:
Endpoint: `/rmeta/text`
Method: `PUT`
Headers:
- writeLimit = 32000000
- maxEmbeddedResources = 0
- fileUrl = file:///path/to/file
Files are no greater than 100Mb, the maximum number of bytes tika text will be (writeLimit) 32Mb.
Each pod is parsing about 370,000 documents per day. I've been messing with a ton of different attempts at settings.
I previously tried to use the actual Tika "ForkParser" but the performance was far worse than spawning tika servers. So that is why I am using Tika Server.
I don't hate the performance results of this.... but I feel like I'd better reach out and make sure there isn't someone out there who sanity checks my numbers and is like "woah that's awful performance, you should be getting xyz like me!"
Anyone have any similar things you are doing? If so, what settings did you end up settling on?
Also, I'm wondering if Apache Http Client would be causing any overhead here when I am calling to my Tika Server /rmeta/text
endpoint. I am using a shared connection pool. Would there be any benefit in say using a unique HttpClients.createDefault() for each thread instead of sharing a connection pool between the threads?