Erlang's let-it-crash philosophy - applicable elsewhere?

Asked 8/12, 2010 at 22:57 Answered 16/4, 2020 at 9:54

Solved java .net erlang defensive-programming

Erlang's (or Joe Armstrong's?) advice NOT to use defensive programming and to let processes crash (rather than pollute your code with needless guards trying to keep track of the wreckage) makes so much sense to me now that I wonder why I wasted so much effort on error handling over the years!

What I wonder is - is this approach only applicable to platforms like Erlang? Erlang has a VM with simple native support for process supervision trees and restarting processes is really fast. Should I spend my development efforts (when not in the Erlang world) on recreating supervision trees rather than bogging myself down with top-level exception handlers, error codes, null results etc etc etc.

Do you think this change of approach would work well in (say) the .NET or Java space?

Strychnic answered 8/12, 2010 at 22:57 Comment(3)

I wrote this some time ago: mazenharake.wordpress.com/2009/09/14/let-it-crash-the-right-way maybe you can find something useful there as well. – Sheeree 9/12, 2010 at 21:28

Thanks Mazen. That's a good post! I get the point of the philosophy you're describing - what I wonder is whether the threading, process or appdomains of .NET (say) are up to the task of restarting as a form of control construct...? – Strychnic 10/12, 2010 at 4:1

I think this can be applied everywhere. I'm however out on thin ice here because I can't prove it :) So for me this is just a feeling or a guess, I haven't tried it in another language to know :) – Sheeree 16/2, 2011 at 23:25

It's applicable everywhere. Whether or not you write your software in a "let it crash" pattern, it will crash anyway, e.g., when hardware fails. "Let it crash" applies anywhere where you need to withstand reality. Quoth James Hamilton:

If a hardware failure requires any immediate administrative action, the service simply won’t scale cost-effectively and reliably. The entire service must be capable of surviving failure without human administrative interaction. Failure recovery must be a very simple path and that path must be tested frequently. Armando Fox of Stanford has argued that the best way to test the failure path is never to shut the service down normally. Just hard-fail it. This sounds counter-intuitive, but if the failure paths aren’t frequently used, they won’t work when needed.

This doesn't precisely mean "never use guards," though. But don't be afraid to crash!

Serrato answered 8/12, 2010 at 23:1 Comment(9)

BUT: is it cheap enough to use hard failure as a control construct in VMs other than Erlang's? – Strychnic 8/12, 2010 at 23:34

In mission-critical systems, a common solution is to have a "watchdog" process which monitors the primary application. The primary application is designed to fail-fast (thus avoiding problems re corruption of program state), and the watchdog can restart it fresh (or fail-over to another system if using a hot-backup design). – Kalindi 8/12, 2010 at 23:34

@Andrew: I would say yes. I've used fail-fast on .NET and native Win32 code (my background is production-critical automation programming). Microsoft's Windows Error Reporting system is designed for fail-fast applications. – Kalindi 8/12, 2010 at 23:37

@Stephen Cleary - yeah I do that myself in .NET, but it's so laborious compared to 'spawn_link'. Are there nice .NET/Java frameworks to replicate these nice features of Erlang (cleanly)? – Strychnic 8/12, 2010 at 23:41

@Craig Stuntz - Very interesting read - and very pertinent to my situation. Thanks! – Strychnic 9/12, 2010 at 0:7

@Andrew: Not in the BCL, but the Windows API Code Pack includes ApplicationRestartRecoveryManager.RegisterForApplicationRestart, which restarts your app automatically if it ends up in Windows Error Reporting (only available on Vista and later). – Kalindi 9/12, 2010 at 14:46

@Stephen - sorry, your comment got hidden by SO. Similarly I use service recovery for NT services. But I was thinking more about AppDomains (which seems to be the nearest thing to an Erlang Process). There's no similar mechanism, and I'm not sure how long they take to set up... – Strychnic 10/12, 2010 at 4:33

@Andrew: Unfortunately, AppDomains aren't as isolated as people like to pretend they are. :) For true isolation, you need to use an AppDomain and also write a cooperating CLR host; this is the approach that SQL Server and IIS take, but it's overkill for most people. AppDomain looks good on paper, but as soon as you allow p/Invoke (or overlapped I/O), your isolation is broken. In the last few years I've migrated away from AppDomains and just used Win32 processes as the isolation container. It's a bit less efficient but has built-in OS support. – Kalindi 10/12, 2010 at 14:29

@PabloFernandez: Thanks; looks like the site is dead. I replaced it with an archive.org version. – Serrato 16/5, 2012 at 13:28

Yes, it is applicable everywhere, but it is important to note in which context it is meant to be used. It does not mean that the application as a whole crashes which, as @PeterM pointed out, can be catastrophic in many cases. The goal is to build a system which as a whole never crashes but can handle errors internally. In our case it was telecomms systems which are expected to have downtimes in the order of minutes per year.

The basic design is to layer the system and isolate central parts of the system to monitor and control the other parts which do the work. In OTP terminology we have supervisor and worker processes. Supervisors have the job of monitoring the workers, and other supervisors, with the goal of restarting them in the correct way when they crash while the workers do all the actual work. Structuring the system properly in layers using this principle of strictly separating the functionality allows you to isolate most of the error handling out of the workers into the supervisors. You try to end up with a small fail-safe error kernel, which if correct can handle errors anywhere in the rest of the system. It is in this context where the "let-it-crash" philosophy is meant to be used.

You get the paradox of where you are thinking about errors and failures everywhere with the goal of actually handling them in as few places as possible.

The best approach to handle an error depends of course on the error and the system. Sometimes it is best to try and catch errors locally within a process and trying to handle them there, with the option of failing again if that doesn't work. If you have a number of worker processes cooperating then it is often best to crash them all and restart them again. It is a supervisor which does this.

You do need a language which generates errors/exceptions when something goes wrong so you can trap them or have them crash the process. Just ignoring error return values is not the same thing.

Grubb answered 9/12, 2010 at 22:29 Comment(3)

I understand that throw and bomb-out is not what the philosophy is about. My question pertains to the performance implications of a properly implemented 'let it crash' approach in systems OTHER THAN Erlang. ;-) Erlang seems to be uniquely tuned/designed to exploit this philosophy, whereas .NET (e.g.) doesn't seem to be. I'm after counter examples/frameworks that can disprove this contention. Clearly targeted designs will always be needed to exploit 'fast fail'. If I take ages to load all the state, dependencies etc before I can retry or resume, then it's not a viable option. – Strychnic 10/12, 2010 at 4:13

@Andrew Matthews: There are (at least) two different problems here. If you want to use processes for error handling the same way as in Erlang then the concurrency should be lightweight as in Erlang so as to minimise the time when part of the system is not working. You also have the problem you mentioned of handling state, I would say this is a design issue but the design will most likely be language/system specific to exploit features of the language. For example in Erlang you could have a supervisor managing an ETS table so that when a worker crashes it will not have to be reloaded. – Grubb 13/12, 2010 at 11:21

I see. And because the supervisor now handles a shared resource, IT must be supervised? In practice where do you cut off the infinite regress? I see whole new vistas of trade-offs to be negotiated. ;-) – Strychnic 14/12, 2010 at 2:15

It is called fail-fast. It's a good paradigm provided you have a team of people who can respond to the failure (and do so quickly).

In the NAVY all pipes and electrical is mounted on the exterior of a wall (preferably on the more public side of a wall). That way, if there is a leak or issue, it is more likely to be detected quickly. In the NAVY, people are punished for not responding to a failure, so it works very well: failures are detected quickly and acted upon quickly.

In a scenario where someone cannot act on a failure quickly, it becomes a matter of opinion whether it is more beneficial to allow the failure to stop the system or to swallow the failure and attempt to continue onward.

Hideous answered 8/12, 2010 at 23:2 Comment(6)

The navy are experts at handling pipes apparently – Cracksman 8/12, 2010 at 23:5

Genuine question: Why do people spell "Navy" as "NAVY" - Navy is not an acronym? – Wrathful 8/12, 2010 at 23:6

I would have thought that the Navy has multiple backup and redundant systems so if a ship gets hit, it can keep on fighting. I would also imagine that quite a number of these systems are automatic, i.e. reactor shutdowns etc. Isn't that the equivalent of defensive programming? Just a thought :-) – Outlandish 8/12, 2010 at 23:15

The capitalization makes me sing "In the Navy, in the Navy" in my head all the time...damn you. – Leggat 8/12, 2010 at 23:41

Automatic systems fail. To back them up the Navy uses men. Some things are more automatic than others, and to back those items up more similar systems are used. For ship electrical, the cost of a electrical distribution system (which typically is computer controlled) and redundant wiring would be prohibitive (and not necessarily more reliable unless you added personnel). While the government contractors might let costs overrun with ease, the actual Navy is quite a penny pincher. – Hideous 9/12, 2010 at 15:18

The all caps NAVY is common usage, but is probably incorrect. The Navy is heavily acronym driven NAVSTA (Naval Station), NAS (Naval Air Station), etc. I guess eventually people in the Navy become accustomed to the all caps acronyms and type NAVY. I have nothing more definitive to back this up, except my time in service. – Hideous 9/12, 2010 at 15:20

I write programs that rely on data from real world situations and if they crash they can cause big $$ in physical damage (not to mention big $$ in lost revenue). I would be out of a job in a flash if I did not program defensively.

With that said I think that Erlang must be a special case that not only can you restart things instantly, that a restarted program can pop up, look around and say "ahhh .. that was what I was doing!"

Halothane answered 8/12, 2010 at 23:3 Comment(8)

yeah - the point is not to go down permanently, but to 'flush' the corrupted state for the thread of execution by failing the process (not process in the conventional sense, BTW, more like a lightweight thread with extras). I suppose that that requires its own form of discipline - like the strategies described to cope with exceptions in C++ by Herb Sutter... – Strychnic 8/12, 2010 at 23:38

@Andrew Mathews - Well if you flush the corupted state, and the program restarts with the same inputs, aren't you setting yourself up for the same situation that caused the crash in the first place? Or is a crash considered a transient event, and hence not repeatable? – Halothane 8/12, 2010 at 23:55

@Peter M : If your code is side-effect free and you feed it the same input then it will crash with the same error. Erlang supervisors have parameters that control how many times a failing process will be started in a given time period. If the process crashes outside of the supervisor's parameters then the supervisor will crash, and its supervisor will be notified. But this is still better than what you get in a mutable language. The difference in Erlang is that you can still have the process handle non-failing calls, and you can fix the bug and hot-load it without bringing the system down. – Minnick 9/12, 2010 at 18:21

@Peter M: The goal is to build a robust system which can handle errors and crashes internally but never as a whole go down. In our case it was telecoms system which are expected to have downtimes in the order of minutes per year. The basic design is to layer the system in such way that each layer can handle errors in the next layer with the goal of having a small error kernel. This means that large parts of the system don't have to handle errors, which makes them safer and the system as a whole more robust. It is in this context which the "let-it-crash" philosophy is used. – Grubb 9/12, 2010 at 21:43

@Grubb - WRT to layering: (without wishing to grumble) I've seen in more than one place that 'exception neutral' code (that may carefully manage atomicity in the face of exceptions) mutates over time in the hands of support engineers et al from exception-neutral -> catch-and-log-then-rethrow -> catch-and-log-then-swallow. 'Mature' code bases can end up with mishmashes of error handling techniques and most of those inadvertantly destroy the atomicity of the original design. Not sure how to re-assert the proper layering in the face of entropy. – Strychnic 10/12, 2010 at 3:55

@Andrew Matthews: That is a known problem for which there is no real known solution. :-) It is alleviated if most programmers work in the crashable part of the system and only the select few in the error handling part, but it is not solved. – Grubb 10/12, 2010 at 10:38

@Grubb - was that an issue for you guys when producing the system software for those 'large' high-uptime switch systems that are mentioned in the various Erlang books? Is this approach (bit.ly/d03KUJ) part to your design process? (I assume from your last comment that it must be - how do you find that works for you day-by-day?) – Strychnic 13/12, 2010 at 1:3

@Andrew Matthews: Yes, very much so. Generally speaking the smaller the critical core is the smaller the chance that there are fatal errors in the system and the more tolerant it is to errors. This is formalised in OTP with supervisors and workers and supervision trees. – Grubb 13/12, 2010 at 11:12

My colleagues and myself thought about the topic not especially technology wise but more from a domain perspective and with a safety focus.

The question is "Is it safe to let it crash?" or better "Is it even possible to apply a robustness paradigm like Erlang’s “let it crash” to safety-related software projects?".

In order to find an answer we did a small research project using a close-to-reality scenario with industrial and especially medical background. Take a look here (http://bit.ly/Z-Blog_let-it-crash). There is even a paper for download. Tell me what you think!

Personally I think it is applicable in many cases and even desirable, especially when there is a lot of error handling to do (safety-related systems). You cannot always use Erlang (missing real time features, no real embedded support, costumer whishes ...), but I'm pretty sure you can implement it otherwise (e.g. using threads, exceptions, message passing). I haven't tried it yet though, but I'd like to.

Concessive answered 14/9, 2013 at 12:10 Comment(2)

The link is broken. Can you fix it? – Inoperable 9/4, 2021 at 5:58

Archive for your blog post: web.archive.org/web/20160308170353/http://blog.zuehlke.com/en/… – Gloomy 11/6, 2021 at 16:45

IMHO Some developers handle/wrap checked exceptions with code which add little value. It is often simpler to allow a method to throw the original exception unless you are going to handle it and add some value.

Ardy answered 8/12, 2010 at 23:2 Comment(2)

It makes debugging substantially easier if you throw a descriptive exception instead of letting lower level details leak through your interface. But I agree that the trade-off is not always in favour of re-wrapping exceptions. – Aspergillus 8/12, 2010 at 23:10

Totally agree. My experience is that I often find error handling which does nothing. My personal code catches only when I can do something about it. For example I might catch a timeout on a connection so I can tell the user to try again. But I will not catch a runtime coming from inside a third party API. I'll let it propagate and a top elvel generalised handler deal with it. – Outlandish 8/12, 2010 at 23:18

Yes, even in economy, see this article: https://www.nytimes.com/2020/04/16/upshot/world-economy-restructuring-coronavirus.html . The World became a "spaghetti code" and is suffering a "Global State" issue.

Woollen answered 16/4, 2020 at 9:54 Comment(1)

Thanks Henry, sounds really thought provoking. – Strychnic 4/8, 2020 at 20:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags