Apologies for the rather open nature of the question, but I think its a very valuable area of discussion.
Following the recent AWS outage and the huge number of horror stories that followed it, I was really impressed by the Chaos Monkey 'technique' applied by Netflix (one of the few to survive pretty much without a scratch.
For those who don't know the concept, it is essentially a little bot that goes around your infrastructure, causing chaos along the way, as a way of continuously testing resilience.
Besides Jeff Atwood's Chaos Monkey post I've been able to find little on this being employed anywhere else.
Whilst I appreciate that good test-driven development is a solid foundation, I think that this would be a great addition to the arsenal of any company/organisation that wants to stay up.
- Has anyone else approached this topic before?
- Are there particular areas other than connectivity and security vulnerability that you would see such a piece of code hitting?
- Any other thoughts/feelings on this approach?