Architecture of Google's distributed supervision model
Asked Answered
B

1

6

I have read an interesting post online where a Google employee discusses that Google would not benefit from Erlang's supervision model because they have built an equivalent supervision model into their infrastructure:

(full disclosure: I work at google and also like erlang) Erlang has fantastic facilities for robustness and concurrency. What it does not have is type safety and it's terrible at handling text in a performant fashion. So if you don't care about either of those things and only care about robustness and concurrency then Erlang is great. There were internal discussions about Erlang here but the upshot was. We had already basically duplicated Erlangs supervision model in our infrastructure, only we did it for all languages and Erlang didn't offer any benefits in performance for us.

Source: http://erlang.org/pipermail/erlang-questions/2013-August/075135.html

Despite searching online, I cannot find any information on their supervision model (it is most probable that I'm searching using the wrong search terms).

Questions:

  1. What is the architecture of Google's supervision model?
  2. For many of Google's published innovations, there has later followed open source software that provides the same functionality (e.g. Google BigTable -> HBase, MapReduce -> Hadoop, etc). Does Netflix's Exhibitor perform all the roles one would expect of the Google supervision infrastructure mentioned in the above quote?
Ballon answered 7/4, 2014 at 16:29 Comment(4)
Questions 1 and 3 seem to be off-topic (asking for off-topic resources and software recommendations, respectively), and the overall topic seems a bit broad.Predation
@LittleBobbyTables - thanks for the feedback, I've reordered the questions, and I'm thinking how I can reword those questions to keep them on-topic. If not, I'll remove themBallon
Apache Mesos framework looks as though it may provide similar functionality to Netflix's Exhibitor for managing distributed applications.Ballon
Apache Stratos (incubating) framework also performs the role of a Supervisor when the application it is monitoring is running in a Stratos managed cartridge (cartridge = application container). See here for the Mailing List Discussion. Disclaimer, I'm a committer on Apache Stratos (incubating)Ballon
K
14

We know relatively little about the internal infrastructure at Google. The only thing you can gleam is by either being employed at Google, or by reading papers.

Google use a model where distribution and supervision happens at the UNIX process level. This makes sense for a number of reasons:

  • Processes have isolation in UNIX due to the protection from the memory-management-unit.
  • A crashing process can be restarted, perhaps on another machine.
  • UNIX is a well-known target.

On top of this, Google builds infrastructure which allows you to "plug in" sequential systems in order to easily make them distributed. The "Chubby lock manager" comes to mind here.

In contrast, Erlangs model is about protection as well, but for light-weight-processes running in the same memory space or by communication over TCP sockets. It provides its own eco-system in which to handle supervision and distribution. Thus while the problems are the same at the surface, the details are different.

The quote also gets a number of things utterly wrong:

  • Erlang is a safe language in the sense that a program will either progress to compute a value or by faulting with an error, often resulting in a crash of said process. There is no way the program can "go wrong" in the sense of undefined behaviour. Erlang does support a variant of static typing, namely success typing. Type enforcement is entirely at run-time however. Erlang does not have a rich type system, like what some people call "strongly typed".

  • Erlang has very fast string processing. I don't know where that myth comes from. It takes more knowledge to work with Erlangs string processing, but it has the distinct advantage that it rules out many typical bugs which occur when processing strings in other languages.

The reason nobody answers this question is that it is hard. A google employee probably can't due to leaking of IP. A non-google employee can only point to the relevant papers about their infrastructure.

Suffice to say though, you will need distribution capabilities in any larger system setup today. But the question is "Do you get this by copying what google did 5-10 years ago?"

Kistner answered 8/4, 2014 at 10:46 Comment(1)
At my company we were knee-deep in a process-level supervision framework project that worked at the Unix (and partially in Windows) process level -- extremely similar to what this questions says Google is doing. Working at the OS level let us do some good things, but it carried more constraints than we initially realized. Then we (re)discovered Erlang and realized that we were reinventing the wheel. The constraint we now face is working with the Erlang VM, and it has turned out to have been a good tradeoff.Wahl

© 2022 - 2024 — McMap. All rights reserved.