How to manage page cache resources when running Kafka in Kubernetes

I've been running Kafka on Kubernetes without any major issue for a while now; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka.

Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache.

I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (i.e., there's memory available on their nodes).

In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster.

My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods?

I thought this was an interesting question, so this is a posting of some findings from a bit of digging.

Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet.

Findings:

Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed.

http://man7.org/linux/man-pages/man2/posix_fadvise.2.html

Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO:

https://lwn.net/Articles/457667/

There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint:

http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise

There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs:

http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf

Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host:

vm.dirty* for guidance on when to get written-to (dirty) pages back onto disk
vm.vfs_cache_pressure for guidance on how aggressive to be in using RAM for page cache

Support in the kernel for device-specific writeback threads goes way back to the 2.6 days:

https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics

Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning:

https://andrestc.com/post/cgroups-io/

That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files:

https://github.com/david415/linux-ftools

So there's enough there. Given specific kafka and cassandra workloads (e.g. read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model.

Recommended topics

Hot tags