Surviving Fire: using Erlang as an OS to Achieve Massive Fault Tolerance by Sam Williams


By Erlang Central | Published: September 26, 2016



Slides and more info: http://www.erlang-factory.com/euc2016/sam-williams

This talk provides an overview of a new project that uses one BEAM instance per core, running directly on the metal, as the base of an operating system. The OS has been built from the ground up to provide software and hardware fault tolerance. Due to the operating system’s structure, it can withstand the failure of Erlang processes, entire Erlang nodes and failure of various hardware elements (like CPU cores, RAM modules and hard drives), without incurring total loss of operation. The talk will provide a brief background of the related operating system concepts, such that it is accessible to those not familiar with operating system design.

As well as describing the project and progress that has been made so far, this talk will demonstrate how we are testing the hardware fault tolerance of the operating system. These hardware fault tolerance tests come in the form of interrupting and damaging computers in various ways during operation, in order to catalogue how the operating system reacts.

Talk objectives:

To explain how Erlang can be used, by deploying one BEAM instance per core, running on the metal, to build fault tolerance to hardware, as well as software failure.

Target audience:

Anyone that is interested in fault tolerance, operating systems or fire!