How to manage page cache resources when running Kafka in Kubernetes
Asked Answered
P

1

9

I've been running Kafka on Kubernetes without any major issue for a while now; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka.

Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache.

I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (i.e., there's memory available on their nodes).

In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster.

My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods?

Polypetalous answered 4/2, 2018 at 15:52 Comment(2)
Setting aside k8s, I don't see a way the precise sort of isolation described in the question can be accomplished sensibly either programmatically or with configuration, do you? mlock + mmap will keep unneeded pages in RAM. Cgroups can throttle both apps, but that's really not what one wants. Pointing the apps at their own I/O resources isn't going to alleviate host contention. Ordinarily one would just not run two i/O hungry services on a single host. Can you use taints to isolate them to distinct groups of nodes instead?Factitive
I'm using k8s anti-affinity rules to separate them now, but it seems limiting. If you're confident there's no way of doing this, write up an answer with specifics outlining why not, and I'll award you the bounty.Polypetalous
F
12

I thought this was an interesting question, so this is a posting of some findings from a bit of digging.

Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet.

Findings:

Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed.

http://man7.org/linux/man-pages/man2/posix_fadvise.2.html

Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO:

https://lwn.net/Articles/457667/

There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint:

http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise

There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs:

http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf

Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host:

  • vm.dirty* for guidance on when to get written-to (dirty) pages back onto disk
  • vm.vfs_cache_pressure for guidance on how aggressive to be in using RAM for page cache

Support in the kernel for device-specific writeback threads goes way back to the 2.6 days:

https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics

Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning:

https://andrestc.com/post/cgroups-io/

That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files:

https://github.com/david415/linux-ftools

So there's enough there. Given specific kafka and cassandra workloads (e.g. read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model.

Factitive answered 11/2, 2018 at 16:9 Comment(1)
Thank you, this is a really good answer. I was looking for generic page cache on k8s advice and it was very helpful.Melody

© 2022 - 2024 — McMap. All rights reserved.