Erlang Central

A Framework for Clustering Generic Server Instances

Revision as of 15:25, 18 April 2007 by Mole5star (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Contents

Author

Christoph Dornheim

Overview

This tutorial describes a simple framework developed for running servers in a cluster. The server to be clustered must be an instance of the gen_server behaviour which is the common way of designing servers according to the Erlang OTP principles (see the gen_server design principle and API).

We first explain what server clustering means and why it makes sense to run a server in a cluster. After showing the key ideas of the framework, we describe in detail how server clustering with this framework works. This is illustrated by a short example. The framework consists of only one module gen_server_cluster. Before presenting the complete code, we make clear how to use its API functions. Finally, we give another application of gen_server_cluster: a small but highly available chat system.


Motivation

In general, clustering a server means running several instances of this server simultaneously in a way that the cluster consisting of these servers appears to clients to be a single server process dispatching their requests. One of the main reasons for clustering is to increase service availabilty: if one of the servers in the cluster crashes or is no longer reachable due to network failures, the service is kept alive by some other server in the cluster. Thus, to provide high availability, the servers in the cluster are located preferably on different machines.

Due to Erlang's built-in mechanism for fault-detection, reaching a certain level of availability for an Erlang gen_server instance is made easy: just let the server process be monitored by some other process, e.g. by an OTP supervisor, that restarts the server when it has terminated. In case that the node the server has been running on went down, you can simply restart the server on some different node, if available. Obviously, the monitoring process needs to be observed as well.

There is, however, a problem when the server is holding some state needed for servicing the client requests. To provide continous service, the state of the restarted server process should be initialized to the last version of the terminated server state. As a solution, the server can persist its state in a database or file system the restarted server must read, but this makes the service availability depend entirely on the database availability.

Instead of restarting a server to make it available again, the framework we will describe allows a server to be run in a cluster. The cluster consists of a dynamically extensible set of server processes each running on a different Erlang node. The key idea is that exactly one server process is responsible for dispatching client requests and updating the states of all servers to keep them in sync. If the active process dies, all other background processes compete for becoming the new active one, but only one of them wins. Thus, the service is available as long as the cluster consists of at least one server at any time. Availability is increased just by adding background processes to the cluster.


The gen_server_cluster framework in action

The cluster framework consists of the single module gen_server_cluster which is built on top of the gen_server behaviour. To be precise, gen_server_cluster is a callback module for gen_server, thus exporting the callback functions handle_call, handle_cast etc. that are called by gen_server. On the other hand, gen_server_cluster is itself a behaviour requiring the target module, i.e. the module of the server to be clustered, to be a callback module of gen_server. This allows gen_server_cluster to delegate client requests received from gen_server to the target. The next section explains in more detail how gen_server_cluster works.

How it works

When a target server is started as a normal server using gen_server:start or gen_server:start_link, a process is locally or globally registered waiting for incoming requests that are sent from client processes using e.g. gen_server:call. To return the result, gen_server calls a matching clause of the handle_call callback function provided by the target module.

Now consider what is going on when the target server is started to run in a cluster using the gen_server_cluster:start function. By internally calling gen_server:start with gen_server_cluster as the callback module a server process is globally registered under some arbitrary name. When a client makes a request by calling gen_server:call, this process receives the request message and invokes gen_server_cluster:handle_call to get a result that can be delivered to the client. Since the target module is contained in the server's state, gen_server_cluster:handle_call can forward the request by calling the handle_call function of the target module to finally get the return value for the client. Note that, in particular, the server's state must hold the target state, i.e. the state that the target server hold if it is started as a normal server. This target state is passed as an argument to any function call of the target module.

So far the cluster consists of only one server process running on some node, say node1. We call it the global server as it is globally registered. In order to extend the cluster with a background server running on a different node, say node2, just call gen_server_cluster:start at node2 (alternatively, call gen_server_cluster:start_local_server with node2 as an argument from an arbitrary connected node). This start function detects that the global server is already running, thus it starts a new process working as a background server. Since it is registered locally at node2, we call this process a local server. As long as the global server is alive, the only purpose of the local server is to keep its state equal to the global server state. This allows the local server to replace the global one once the global server or its node goes down. Therefore, at startup, the local server requests the current global server state. While the global server still handles any client request, it additionally has to send each state update to all local servers.

Suppose now that in the presence of some more local servers the global server terminates due to some error, e.g., a breakdown of its node. Since all local server processes are linked to the global server process and are trapping exit signals, they are able to take the following action. Each local server tries to globally register itself to become the new global server. The global registry makes sure that only one server succeeds. All other local servers remain background processes. Note in particular that due to this registration strategy, choosing the new global server in the cluster is not deterministic.

Example I: Running a simple server in a cluster

At this point it may be helpful seeing gen_server_cluster in action. The module id_server given below is a minimal gen_server instance just managing a single integer usable as an ID. Calling id_server:next_value returns the current value and increases it by 1.

-module(id_server).

-behaviour(gen_server).

%% API
-export([start_cluster/0, next_value/0, stop/0]).

%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
	 terminate/2, code_change/3]).

-record(state, {id=0}).

start_cluster() ->
    gen_server_cluster:start(?MODULE, ?MODULE, [], []).

next_value() ->
    gen_server:call({global,?MODULE}, nextValue).

stop() ->
    gen_server:call({global,?MODULE}, stop).

init([]) ->
    {ok, #state{}}.

handle_call(nextValue, _From, State) ->
    Value = State#state.id,
    {reply, Value, State#state{id=Value+1}};

handle_call(stop, _From, State) ->
    {stop, normalStop, State}.

handle_cast(_Msg, State) ->
    {noreply, State}.

handle_info(_Info, State) ->
    {noreply, State}.

terminate(_Reason, _State) ->
    ok.

code_change(_OldVsn, State, _Extra) ->
    {ok, State}.

To run id_server in a cluster of two connected Erlang nodes, we first start two shells using erl -sname node1 and erl -sname node2, respectively, and then establish a connection between them, e.g. using net_adm:ping.

We start the cluster at node1 by calling id_server:start_cluster:

Erlang (BEAM) emulator version 5.5.3 [async-threads:0]

Eshell V5.5.3  (abort with ^G)
(node1@hamburg)1> id_server:start_cluster().
id_server started as global server.
{ok,<0.47.0>}
(node1@hamburg)2> id_server:next_value().   
0

Next we increase the cluster by starting a second server at node2 which becommes a local server. To demonstrate server availability provided by this cluster, we stop the current global server using a function of gen_server_cluster: gen_server_cluster:stop(id_server,global) (alternatively, we can simply cancel node1.) As the output at node2 shows, the former local server at node2 is now the new global server. The value returned when calling id_server:next_value proves that the state of id_server was correctly synchronized.

Erlang (BEAM) emulator version 5.5.3 [async-threads:0]

Eshell V5.5.3  (abort with ^G)
(node2@hamburg)1> id_server:start_cluster().
id_server started as local server.
{ok,<0.47.0>}
(node2@hamburg)2> id_server:next_value().   
1
(node2@hamburg)3> gen_server_cluster:stop(id_server,global).
{stopByGenServerCluster,stopGlobalServer}
Global server on node1@hamburg terminated: stopGlobalServer.
New global server on node2@hamburg.
(node2@hamburg)4> id_server:next_value().                   
2

At node1, some stop information is displayed.

Server id_server stopped.
(node1@hamburg)3> 
=ERROR REPORT==== 14-Apr-2007::16:36:34 ===
** Generic server id_server terminating 
** Last message in was {stopByGenServerCluster,stopGlobalServer}
** When Server state == {state,id_server,
                               <0.47.0>,
                               [<3997.47.0>],
                               [],
                               id_server,
                               {state,2}}
** Reason for termination == 
** stopGlobalServer


The gen_server_cluster module

API functions

Before presenting the code of gen_server_cluster, we briefly describe its API functions. As shown in the example above, a cluster is started or increased with start(Name, TargetModule, TargetArgs, Options) where the name of the cluster, its module, the arguments for its startup and some gen_server options (see the gen_server API) are specified. Depending on whether the cluster already exists, either a global server or a local server is started on the node that evaluates this function. To add a local server to an existing cluster, simply use start_local_server(Name) where irrelevant arguments are omitted.

The function is_running(Name) checks if there is a cluster running that is registered under the name. Some information about the servers in the cluster is provided by get_all_server_pids(Name) and get_all_server_nodes(Name): whereas the former returns the global server pid and the list of all local server pids, the latter returns the global server node and the list of all local server nodes. If desired, the state of the target server can be obtained with get_target_state(Name).

The cluster is terminated if a reply of the target server is a stop request. But you can also stop particular servers of the cluster: stop(Name, global) stops the current global server, stop(Name, all) stops the entire cluster, and stop(Name, Node) stops the global or local server running on the specified node.

The last two API functions are provided for target servers that are monitoring other processes by creating a link to them and trapping their exit signals. If a monitored process terminates, the target server is notified by a call of its handle_info function which allows it to take some appropriate action. Suppose that the target server creates a link to some other process using the usual erlang:link function. This means that the current global server process is linked to that process. However, this link gets lost if the global server dies. The new global server has no access to its link set, thus it is unable to continue monitoring the other process. This problem is avoided when using link(Name,Pid) in the target module instead of the usual link function. It sends an asynchronous request for creating a link to the specified process. This link is stored in the global server state and replicated in all local server states such that each of them can link itself to that process when becoming the new global server. The corresponding function for removing links is unlink(Name, Pid).

Increasing state update efficiency

Most values returned by the target server callback functions contain the new target state. Since this state is sent to all local servers, state updates can be an expensive operation if large states are transferred over the network. It may be more efficient to send a function instead that returns the new target state when applied to the old target state. This reduces transfer cost to the disadvantage of state update performance.

The gen_server_cluster server can cope with both return types: if the target server returns a function term of arity 1, this is interpreted as a state update function. An update function is evaluated both by the gobal server and any local server to get the new target state. Otherwise, the return value is handled as the new state value.

Implementation

The complete gen_server_cluster module is given below. The code contains some detailed comments that hopefully helps in understanding of what the functions actually do. It is particularly helpful to see which kind of server (global or local) the various callback functions are meant for.

-module(gen_server_cluster).
-author('Christoph Dornheim').

-behaviour(gen_server).

-export([start/4,
	start_local_server/1,
	get_target_state/1,
	get_all_server_pids/1,
	get_all_server_nodes/1,
	is_running/1,
	stop/2,
	link/2,
	unlink/2]).

%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
	 terminate/2, code_change/3]).

%% the state of gen_server_cluster
-record(state, {name,
		globalServerPid,
		localServerPidList=[],
		targetLinkedPidList=[],
	        targetModule,
	        targetState}).


%%====================================================================
%% API
%%====================================================================

%% Starts the global or local server under the given name where 
%% the target gen_server is given by the target module.
%% If some global server is already running, the server is started as a local
%% server, otherwise as the new global server. 
start(Name, TargetModule, TargetArgs, Options) ->
    ArgsGlobalInit = {initGlobal, Name, TargetModule, TargetArgs},
    %% try starting as the global server:
    case gen_server:start({global,Name}, ?MODULE, ArgsGlobalInit, Options) of
	{ok,_GlobalServerPid}=Result ->
	    io:format("~p started as global server.~n",[Name]), 
	    Result;
	{error,{already_started,_GlobalServerPid}} ->
	    %% global server already started, start as local server:
	    start_local_server(Name)
    end.

%% Starts a local server if a global server is already running,
%% otherwise noGlobalServerRunning is returned.
start_local_server(Name) ->
    case is_running(Name) of
	false -> 
	    noGlobalServerRunning;
	true ->
	    ArgsLocalInit = {initLocal, Name},
	    case gen_server:start({local,Name}, ?MODULE, ArgsLocalInit, []) of
		{ok, _LocalPid}=Result ->
		    io:format("~p started as local server.~n",[Name]),
		    Result;
		Else ->
		    Else
	    end
    end.

%% Returns the global server pid and the list of all local server pids:
%% {pid(), [pid()]}
get_all_server_pids(Name) ->
    gen_server:call({global,Name}, get_all_server_pids).

%% Returns the global server node and the list of all local server nodes:
%% {node(), [node()]}
get_all_server_nodes(Name) ->
    {GlobalServerPid, LocalServerPidList} = get_all_server_pids(Name),
    Fun = fun(Pid) ->
		  node(Pid)
	  end,
    {Fun(GlobalServerPid), lists:map(Fun, LocalServerPidList)}.

%% Returns true if there is some global server running, otherwise false.
is_running(Name) ->
    case catch get_all_server_pids(Name) of
	{Pid, PidList} when is_pid(Pid), is_list(PidList) ->
	    true;
	 _ ->
	    false
    end.

%% Stops the global server.
stop(Name, global) ->
    gen_server:call({global, Name}, {stopByGenServerCluster, stopGlobalServer});

%% Stops both the global server and all local servers.
stop(Name, all) ->
    {_, LocalServerNodeList} = get_all_server_nodes(Name),
    Request = {stopByGenServerCluster, stopAllServer},
    case LocalServerNodeList of
	[] ->
	    ok;
	_ ->
	    gen_server:multi_call(LocalServerNodeList, Name, Request)
    end,
    gen_server:call({global, Name}, Request);

%% Stops the local server running on the given node.
stop(Name, Node) ->
    gen_server:multi_call([Node], Name, {stopByGenServerCluster, stopLocalServer}).

%% Returns the target state.
get_target_state(Name) ->
    gen_server:call({global,Name}, get_target_state).

%% Links the pid with the global server. This link is transferred to 
%% a new global server if the current one dies. 
link(Name, Pid) ->
    %% must be cast since this can be called by target server 
    %% which hangs if we use call instead! 
    gen_server:cast({global,Name}, {link, Pid}). 

%% Removes the link between pid and the global server that was
%% established before.
unlink(Name, Pid) ->
    gen_server:cast({global,Name}, {unlink, Pid}).


%%====================================================================
%% gen_server callbacks
%%====================================================================

%%--------------------------------------------------------------------
%% Function: init(Args) -> {ok, State} |
%%                         {ok, State, Timeout} |
%%                         ignore               |
%%                         {stop, Reason}
%% Description: Initiates the server
%%--------------------------------------------------------------------

%% Called by global server at startup.
init({initGlobal, Name, TargetModule, TargetArgs}) ->
    process_flag(trap_exit, true),
    %% initialize callback server:
    TargetResult =  TargetModule:init(TargetArgs),
    case TargetResult of
	{ok, TargetState} ->
	    State = #state{name=Name,
			   globalServerPid=self(),
			   targetModule=TargetModule,
			   targetState=TargetState},
	    {ok, State};
	{ok, TargetState, _Timeout} ->
	    State = #state{name=Name,
			   globalServerPid=self(),
			   targetModule=TargetModule,
			   targetState=TargetState},
	    {ok, State};
	{stop, Reason} ->
	    {stop, Reason};
	ignore ->
	    ignore
    end;

%% Called by local server at startup.
init({initLocal, Name}) ->
    process_flag(trap_exit, true),
    State = gen_server:call({global,Name}, init_local_server_state),
    {ok,State}.
    
%%--------------------------------------------------------------------
%% Function: %% handle_call(Request, From, State) -> {reply, Reply, State} |
%%                                      {reply, Reply, State, Timeout} |
%%                                      {noreply, State} |
%%                                      {noreply, State, Timeout} |
%%                                      {stop, Reason, Reply, State} |
%%                                      {stop, Reason, State}
%% Description: Handling call messages
%%--------------------------------------------------------------------

%% Called by global server to get the target state.
handle_call(get_target_state, _From, State) ->
    Reply = State#state.targetState,
    {reply, Reply, State};

%% Called by global server to get the pids of the global and all local servers.
handle_call(get_all_server_pids, _From, State) ->
    Reply = {State#state.globalServerPid, State#state.localServerPidList},
    {reply, Reply, State};

%% Called by global server when a local server starts and wants to get the 
%% current state. The pid of this new local server is stored in the server's 
%% state and an update request is sent to all other local servers.
handle_call(init_local_server_state, {FromPid,_}, State) ->
    %% function to update state by storing new localserver pid:
    AddNewLocalServerFun = 
	fun(S) ->
		S#state{localServerPidList=[FromPid|S#state.localServerPidList]}
	end,
    NewState = update_all_server_state(State, AddNewLocalServerFun),
    %% link to this local server:
    link(FromPid),
    Reply=NewState,
    {reply, Reply, NewState};

%% Called by global or local server due to a client stop request.  
handle_call({stopByGenServerCluster, Reason}=Reply, _From, State) ->
    {stop, Reason, Reply, State};

%% Called by local server due to a target stop request.  
handle_call(stopByTarget=Reason, _From, State) ->
    {stop, Reason , State};

%% Called by global or local server and delegates the request to the target. 
handle_call(Request, From, State) ->
    delegate_to_target(State, handle_call, [Request, From, 
					    State#state.targetState]).


%%--------------------------------------------------------------------
%% Function: handle_cast(Msg, State) -> {noreply, State} |
%%                                      {noreply, State, Timeout} |
%%                                      {stop, Reason, State}
%% Description: Handling cast messages
%%--------------------------------------------------------------------

%% Called by local server to update its state.
%% The update function is sent by the global server and returns the new state
%% when applied to the old state. 
handle_cast({update_local_server_state, UpdateFun}, State) ->
    NewState = UpdateFun(State),
    {noreply, NewState};

%% Called by global server due to a client request for linking the pid
%% to the global server. The pid is stored in the state and an state update
%% request is sent to all local servers. 
handle_cast({link,Pid}, State) ->
    %% function to update state by storing new pid linked to target:
    AddNewTargetLinkedPidFun = 
	fun(S) ->
		%% remove Pid first to avoid duplicates:
		L = lists:delete(Pid, S#state.targetLinkedPidList),
		S#state{targetLinkedPidList=[Pid|L]}
	end,
    NewState = update_all_server_state(State, AddNewTargetLinkedPidFun),
    %% link to this local server:
    link(Pid),
    {noreply, NewState};

%% Called by global server due to a client request for deleting the link
%% between the pid and the global server. The updated state is
%% sent to all local servers.
handle_cast({unlink,Pid}, State) ->
    %% function to update state by removing link to pid:
    RemoveTargetLinkedPidFun = 
	fun(S) ->
		S#state{targetLinkedPidList
			=lists:delete(Pid, S#state.targetLinkedPidList)}
	end,
    NewState = update_all_server_state(State, RemoveTargetLinkedPidFun ),
    %% unlink to this local server:
    unlink(Pid),
    {noreply, NewState};

%% Called by global or local server and delegates the request to the target.
handle_cast(Msg, State) ->
    delegate_to_target(State, handle_cast, [Msg, State#state.targetState]).


%%--------------------------------------------------------------------
%% Function: handle_info(Info, State) -> {noreply, State} |
%%                                       {noreply, State, Timeout} |
%%                                       {stop, Reason, State}
%% Description: Handling all non call/cast messages
%%--------------------------------------------------------------------

%% Called by all local servers when the global server has died.
%% Each local server tries to register itself as the new global one,
%% but only one of them succeeds.
handle_info({'EXIT', Pid, Reason}, State) when Pid==State#state.globalServerPid ->
    io:format("Global server on ~p terminated: ~p.~n", [node(Pid), Reason]),
    NewGlobalServerPid = try_register_as_global_server(State#state.name, Pid),
    io:format("New global server on ~p.~n", [node(NewGlobalServerPid)]),
    %% update new global server pid and local server pid list:
    NewLocalServerPidList = lists:delete(NewGlobalServerPid,
					 State#state.localServerPidList),
    NewState = State#state{globalServerPid=NewGlobalServerPid,
			   localServerPidList=NewLocalServerPidList},
    case NewGlobalServerPid==self() of
	true ->
	    %% link to all local servers:
	    F = fun(P) ->
			link(P)
		end,
	    lists:foreach(F, NewLocalServerPidList),
	    %% link to all pids the target was linked to:
	    lists:foreach(F, NewState#state.targetLinkedPidList);
	false ->
	    ok
    end,
    {noreply, NewState};

%% Called by global server when some local server or another process has died.
%% If a local server has died, its pid is deleted from the local server list and 
%% a corresponding state update is sent to all other local servers.
%% If some other linked process has died, this info is delegated to the target.
%% In case it was linked using this module's link function, its pid is deleted
%% from the state and a corresponding state update is sent to all other local 
%% servers.
handle_info({'EXIT', Pid, _Reason}=Info, State) ->
    case lists:member(Pid, State#state.localServerPidList) of
	true ->
	    %% Pid is a local server that has terminated:
	    %% function to update state by deleting the local server pid:
	    RemoveLocalServerFun = 
		fun(S) ->
			S#state{localServerPidList
				=lists:delete(Pid,S#state.localServerPidList)}
		end,
	    NewState = update_all_server_state(State, RemoveLocalServerFun),
	    {noreply, NewState};
	false ->
	    case lists:member(Pid, State#state.targetLinkedPidList) of
		true ->
		    %% Pid was linked to target, so remove from link list:
		    {reply,_,NewState} = handle_call({unlink,Pid}, void, State);
		false ->
		    NewState = State
	    end,
	    %% the target created the link to pid, thus delegate:
	    delegate_to_target(NewState, handle_info, 
					 [Info, NewState#state.targetState])
    end;

%% Called by global or local server and delegates the info request to the target.
handle_info(Info, State) ->
    delegate_to_target(State, handle_info, [Info, State#state.targetState]).


%%--------------------------------------------------------------------
%% Function: terminate(Reason, State) -> void()
%% Description: This function is called by a gen_server when it is about to
%% terminate. It should be the opposite of Module:init/1 and do any necessary
%% cleaning up. When it returns, the gen_server terminates with Reason.
%% The return value is ignored.
%%--------------------------------------------------------------------

%% Called by global or local server when it is about to terminate.
terminate(Reason, State) ->
    TargetModule = State#state.targetModule,
    TargetModule:terminate(Reason,  State#state.targetState),
    io:format("Server ~p stopped.~n", [State#state.name]).


%%--------------------------------------------------------------------
%% Func: code_change(OldVsn, State, Extra) -> {ok, NewState}
%% Description: Convert process state when code is changed
%%--------------------------------------------------------------------

%% Currently not supported.
code_change(_OldVsn, State, _Extra) ->
    {ok, State}.


%%--------------------------------------------------------------------
%%% Internal functions
%%--------------------------------------------------------------------

%% Called by global server to delegate a gen_server callback function
%% (handle_call, etc.) to the target module. The result returned by the
%% target is transformed to a corresponding result for the gen_server_cluster
%% where the target state is updated.
%% This update is sent to all local servers unless the target result is a
%% stop request. In that case a stop request is sent to all local servers 
%% whereas the global server will stop due to the return value.
%% The term returned by the target for the new state can be an update function 
%% or the entire new state. When applied to the old target state this update 
%% function should give the new target state. (This is provided mainly for 
%% optimization as it avoids transmitting large state terms.)
delegate_to_target(State, TargetCall, Args) ->
    TargetModule = State#state.targetModule,
    TargetResult = apply(TargetModule, TargetCall, Args),
    %% index of state in tuple:
    IndexState = case TargetResult of
		     {reply, _Reply, TargetStateUpdate} ->
			 3;
		     {reply, _Reply, TargetStateUpdate, _Timeout}  ->
			 3;
		     {noreply, TargetStateUpdate} ->
			 2;
		     {noreply, TargetStateUpdate, _Timeout} ->
			 2;
		     {stop, _Reason, _Reply, TargetStateUpdate} ->
			 4;
		     {stop, _Reason, TargetStateUpdate} ->
			 3;
		     {ok, TargetStateUpdate} ->  %% for code change
			 2
		 end,
    %% function to update the target state:
    UpdateTargetStateFun = 
	fun(S) ->		
		%% update target state where TargetStateUpdate is 
		%% update function or new state:
		NewTargetState =
		    case is_function(TargetStateUpdate,1) of
			true ->
			    TargetStateUpdate(S#state.targetState);
			false ->
			    TargetStateUpdate
		    end,
		S#state{targetState=NewTargetState}
	end,
    NewState = update_all_server_state(State, UpdateTargetStateFun),
    %% return target reply tuple where state is substituted:
    Result = setelement(IndexState, TargetResult, NewState),

    %% if target stop, stop all local servers. 
    %% The global server will be stopped by returning the result.
    case (element(1,Result)==stop) and (State#state.localServerPidList/=[]) of
	true ->
	    Fun = fun(Pid) ->
			  node(Pid)
		  end,
	    LocalServerNodeList = lists:map(Fun, State#state.localServerPidList),
	    gen_server:multi_call(LocalServerNodeList,
	    			  State#state.name, stopByTarget);
	false ->
	    ok
    end,
    Result.

%% Called by global server to update the state of all servers.
%% The result of applying UpdateFun to State is returned, and
%% an update request to all local servers is sent. 
update_all_server_state(State, UpdateFun) ->
    NewState = UpdateFun(State),
    CastFun = fun(Pid) ->
		      gen_server:cast(Pid, {update_local_server_state,UpdateFun})
	      end,
    lists:foreach(CastFun, State#state.localServerPidList),
    NewState.

%% Called by each local server to try registering itself as the new global 
%% server. It waits until the old registration of OldGlobalServerPid is
%% removed and then tries to register the calling process globally under 
%% the given name. The new registered global server pid is returned which can
%% be this or one of the other local servers trying to register themselves.
try_register_as_global_server(Name, OldGlobalServerPid) ->
    case global:whereis_name(Name) of %% current global server pid in registry
	OldGlobalServerPid ->
	    %% sleep up to 1s until old global server pid is deleted 
	    %% from global registry:
	    timer:sleep(random:uniform(1000)),
	    try_register_as_global_server(Name, OldGlobalServerPid);
	undefined ->
	    %% try to register this process globally:
	    global:register_name(Name, self()),
	    timer:sleep(100), %% wait before next registration check
	    try_register_as_global_server(Name, OldGlobalServerPid);
	NewGlobalServerPid ->
	    %% new global server pid (self() or other local server):
	    NewGlobalServerPid
    end.


Example II: A highly available distributed chat system

Finally, we show that gen_server_cluster can be used for building applications beyond its primary purpose of increasing availability of a given server. As an example, we present a pretty small chat application which has nonetheless some powerful features. Writing chat servers in Erlang is very popular, most of them are designed as a classical client-server system relying on the availability of the central server.

In contrast, our chat application works without such a central server. The chat system is a cluster of chat instances started from different Erlang shells. Each user has to start his own shell (on a different machine, to make sense) for communication with the chat system. The system is started by starting a first instance and terminates when there is no more instance alive. During this time, instances can dynamically be added to or removed from the cluster. Any chat cluster is registered under an arbitrary name such that a user can participate in different chat sessions while using the same shell.

The chat module API is quite simple. Besides start and stop functions it provides a function for sending some text message to the chat that is displayed on every users's shell combined with the chat name and the name of the node it is sent from. The other two API functions lists all messages that are sent so far and the nodes (representing the users) currently involved.

As can be seen in the callback function for the send request, a state update function term is returned to add the new message to the message list. However, this function has additionally the side-effect of writing the new message on the standard output. Since all chat instances evaluate this update function, the message is displayed on the shell of every chat user. Thus, the local servers which normally are only background processes are now playing an active role within this application.

-module(chat).

-behaviour(gen_server).

%% API
-export([start/1,
	 stop/1,
	 send/2,
	 get_all_msg/1,
	 get_users/1]).

%% gen_server callbacks
-export([init/1, handle_call/3, handle_cast/2, handle_info/2,
	 terminate/2, code_change/3]).

-record(state, {msgList=[]}).

start(Name) ->
    gen_server_cluster:start(Name, ?MODULE, [], []).

send(Name, Text) ->
    gen_server:call({global, Name}, {send,{Name,node(),Text}}).

get_all_msg(Name) ->
    io:format("All messages of ~p:~n",[Name]),
    MsgList = gen_server:call({global,Name}, get_all_msg),
    F = fun({Node,Text}) ->
 		io:format("[~p]: ~p~n",[Node, Text])
 	end,
    lists:foreach(F, MsgList),
    ok.

get_users(Name) ->
    {GlobalNode, LocalNodeList}=gen_server_cluster:get_all_server_nodes(Name),
    [GlobalNode|LocalNodeList].

stop(Name) ->
    gen_server:call({global,Name}, stop).

init([]) ->
    {ok, #state{}}.

handle_call({send,{Name,Node,Text}}, _From, _State) ->
    F = fun(State) ->
		io:format("[~p,~p]: ~p~n",[Name, Node, Text]),
		NewMsgList = [{Node,Text}|State#state.msgList],
		State#state{msgList=NewMsgList}
	end,
    {reply, sent, F};

handle_call(get_all_msg, _From, State) ->
    List = lists:reverse(State#state.msgList),
    {reply, List, State};

handle_call(stop, _From, State) ->
    {stop, normalStop, State};

handle_call(_Request, _From, State) ->
    Reply = ok,
    {reply, Reply, State}.

handle_cast(_Msg, State) ->
    {noreply, State}.

handle_info(_Info, State) ->
    {noreply, State}.

terminate(_Reason, _State) ->
    ok.

code_change(_OldVsn, State, _Extra) ->
    {ok, State}.

An example session is shown below where James joins in the conversation.

Erlang (BEAM) emulator version 5.5.3 [async-threads:0]

Eshell V5.5.3  (abort with ^G)
(james@hamburg)1> chat:start(chatroom).
chatroom started as local server.
{ok,<0.50.0>}
(james@hamburg)3> chat:get_users(chatroom).
[fred@hamburg,james@hamburg,peter@hamburg]
(james@hamburg)4> chat:get_all_msg(chatroom).
All messages of chatroom:
[fred@hamburg]: "Is there anybody out there?"
[peter@hamburg]: "Hi Fred."
[fred@hamburg]: "Hi Peter."
ok
(james@hamburg)5> chat:send(chatroom,"Hi all.").
[chatroom,james@hamburg]: "Hi all."
sent
[chatroom,peter@hamburg]: "Hi James."
[chatroom,fred@hamburg]: "Hi James."


Conclusion

We have presented a framework for running a server in a cluster to increase server availability. This framework makes use of background servers with state replication. It can best be applied to servers that are not dependent on where they are localized. For example, a file server that makes local files available might not be a good candidate for being clustered with gen_server_cluster.