Architecture of Google's distributed supervision model

I have read an interesting post online where a Google employee discusses that Google would not benefit from Erlang's supervision model because they have built an equivalent supervision model into their infrastructure:

(full disclosure: I work at google and also like erlang) Erlang has fantastic facilities for robustness and concurrency. What it does not have is type safety and it's terrible at handling text in a performant fashion. So if you don't care about either of those things and only care about robustness and concurrency then Erlang is great. There were internal discussions about Erlang here but the upshot was. We had already basically duplicated Erlangs supervision model in our infrastructure, only we did it for all languages and Erlang didn't offer any benefits in performance for us.

Source: http://erlang.org/pipermail/erlang-questions/2013-August/075135.html

Despite searching online, I cannot find any information on their supervision model (it is most probable that I'm searching using the wrong search terms).

Questions:

What is the architecture of Google's supervision model?
For many of Google's published innovations, there has later followed open source software that provides the same functionality (e.g. Google BigTable -> HBase, MapReduce -> Hadoop, etc). Does Netflix's Exhibitor perform all the roles one would expect of the Google supervision infrastructure mentioned in the above quote?

We know relatively little about the internal infrastructure at Google. The only thing you can gleam is by either being employed at Google, or by reading papers.

Google use a model where distribution and supervision happens at the UNIX process level. This makes sense for a number of reasons:

Processes have isolation in UNIX due to the protection from the memory-management-unit.
A crashing process can be restarted, perhaps on another machine.
UNIX is a well-known target.

On top of this, Google builds infrastructure which allows you to "plug in" sequential systems in order to easily make them distributed. The "Chubby lock manager" comes to mind here.

In contrast, Erlangs model is about protection as well, but for light-weight-processes running in the same memory space or by communication over TCP sockets. It provides its own eco-system in which to handle supervision and distribution. Thus while the problems are the same at the surface, the details are different.

The quote also gets a number of things utterly wrong:

Erlang is a safe language in the sense that a program will either progress to compute a value or by faulting with an error, often resulting in a crash of said process. There is no way the program can "go wrong" in the sense of undefined behaviour. Erlang does support a variant of static typing, namely success typing. Type enforcement is entirely at run-time however. Erlang does not have a rich type system, like what some people call "strongly typed".
Erlang has very fast string processing. I don't know where that myth comes from. It takes more knowledge to work with Erlangs string processing, but it has the distinct advantage that it rules out many typical bugs which occur when processing strings in other languages.

The reason nobody answers this question is that it is hard. A google employee probably can't due to leaking of IP. A non-google employee can only point to the relevant papers about their infrastructure.

Suffice to say though, you will need distribution capabilities in any larger system setup today. But the question is "Do you get this by copying what google did 5-10 years ago?"

Recommended topics

Hot tags