We try to deploy APC user-cache in a high load environment as local 2nd-tier cache on each server for our central caching service (redis), for caching database queries with rarely changing results, and configuration. We basically looked at what Facebook did (years ago):
http://www.slideshare.net/guoqing75/4069180-caching-performance-lessons-from-facebook http://www.slideshare.net/shire/php-tek-2007-apc-facebook
It works pretty well for some time, but after some hours under high load, APC runs into problems, so the whole mod_php does not execute any PHP anymore. Even a simple PHP script with only does not answer anymore, while static resources are still delivered by Apache. It does not really crash, there is no segfault. We tried the latest stable and latest beta of APC, we tried pthreads, spin locks, every time the same problem. We provided APC with far more memory it can ever consume, 1 minute before a crash we have 2% fragmentation and about 90% of the memory is free. When it „crashes“ we don’t find nothing in error logs, only restarting Apache helps. Only with spin locks we get an php error which is:
PHP Fatal error: Unknown: Stuck spinlock (0x7fcbae9fe068) detected in Unknown on line 0
This seems to be a kind of timeout, which does not occur with pthreads, because those don’t use timeouts.
What’s happening is probably something like that: http://notmysock.org/blog/php/user-cache-timebomb.html
Some numbers: A server has about 400 APC user-cache hits per second and about 30 inserts per second (which is a lot I think), one request has about 20-100 user-cache requests. There are about 300.000 variables in the user-cache, all with ttl (we store without ttl only in our central redis).
Our APC-settings are:
apc.shm_segments=1
apc.shm_size=4096M
apc.num_files_hint=1000
apc.user_entries_hint=500000
apc.max_file_size=2M
apc.stat=0
Currently we are using version 3.1.13-beta compiled with spin locks, used with an old PHP 5.2.6 (it’s a legacy app, I’ve heard that this PHP version could be a problem too?), Linux 64bit.
It's really hard to debug, we have written monitoring scripts which collect as much data as we could get every minute from apc, system etc., but we cannot see anything uncommon - even 1 minute before a crash.
I’ve seen a lot of similar problems here, but by now we couldn’t find a solution which solves our problem yet. And when I read something like that:
http://webadvent.org/2010/share-and-enjoy-by-gopal-vijayaraghavan
I’m not sure if going with APC for a local user-cache is the best idea in high load environments. We already worked with memcached here, but APC is a lot faster. But how to get it stable?
best regards, Andreas