You Gotta Have Heart

It’s one of an embedded developer’s worst nightmares: You have devices in the field, on remote sites, where you don’t have access to local staff to serve as your hands. You’re far from the devices, maybe on a different continent, even, and now you have a device that’s no longer responding to network requests, or not sending event data as expected.

It is a terrible feeling.

If you’re using Nerves to build firmware, you have tools at your disposal to help keep your devices online.

Use Shoehorn

First things first, if you’re deploying Nerves firmwares, make sure you’re using Shoehorn in the most advantageous way. Shoehorn allows you to have a list of applications started on the BEAM VM prior to your application being started.

You want to make certain that you get :nerves_network started from the Shoehorn list. One of its dependencies is :nerves_firmware_ssh, which means both :nerves_network and :nerves_firmware_ssh should be initialized prior to your app being started, and more importantly, they should still be running if your application dies or is taken out by the supervisor.

# your Shoehorn config in ./config/config.exs might look similar to this

config :shoehorn,
  init: [:nerves_runtime, :nerves_network, :nerves_init_gadget, :nerves_time],
  app: Mix.Project.config()[:app]

That hopefully leaves you able to push a new firmware to your device via ssh should your application fail, which will then trigger a reboot to load the new firmware. (This assumes, of course, that you are able to SSH to your device, which may well not be the case due to network firewalls, NAT configs, etc.)

So, shoehorn is your first line of defense against a device going dark.

Watchdog

When I first began working with Nerves, I didn’t realize the Beagle Bone Black/Green and RaspberryPi boards I was working with had a hardware watchdog. Each has a file system interface to a hardware feature which will restart the system if the watchdog detects a likely system crash.

When used in this way, once you touch the designated file, the watchdog is activated and on alert, and that file must be touched again every N seconds, or the watchdog will reboot the board.

This is interesting as far as it goes, and by all means, tinker with this system to your heart’s content.

You might be tempted to implement your own watchdog-based scheme for keeping your devices online. I built one as an experiment, and it works ok, but there’s a better alternative to this approach for production Nerves firmware.

Straight to the :heart

The Nerves ecosystem is equipped to use a facility of the Erlang BEAM VM that helps ensure your app remains online.

To start with, there is an Erlang module named heart (that will of course be :heart in your Elixir/Nerves code). Heart runs in the BEAM as an optional feature, and when active will trigger a restart if it detects an application crash. There is also a port program, written in C, on which heart relies. The Nerves Project maintains a fork of Erlang’s heart code for Nerves users.

What does it take to use heart? First of all, make sure that heart is enabled in your vm.args file ($PROJECT_BASE_DIR/rel/vm.args). The option is present in the default vm.args created by Nerves’ mix template:

-heart -env HEART_BEAT_TIMEOUT 30

Then, as early as possible in your start sequence you want to register a function as a callback for heart.

:heart.set_callback(Module, :function)

The callback interface is very simple: if the callback function returns :ok, heart concludes that all is well. Any other return will trigger a system reboot.

Putting :heart to Use

How should you approach your :heart callback? There are certainly multiple ways to approach this. You have to think about the concerns and failure modes of your application. You might need to ensure that you are receiving data from a sensor, a network node is reachable, or any number of combinations of these or other cases.

My first callback relied solely on proof of the network being up. If the link to a control system was dropped, and failed to reconnect during a defined period, a timeout callback in a GenServer module would call a public function in the :heart callback’s GenServer, updating that server’s state so that the :heart callback would return something other than :ok. The next call from :heart to the callback then would trigger a reboot.

Here is an example of such an implementation. (You might wonder why there are two GenServers rather than just one, and the reason is that heart will call the callback every five or so seconds, and these calls to the GenServer with the timeout callback prevent the callback from ever occurring.)

defmodule MyApp.Heartbeat do
  use GenServer

  def check_status() do
    GenServer.call(:heartbeat, :return_status)
  end

  def system_down() do
    GenServer.cast(:heartbeat, :down)
  end

  def start_link(_ \\ []) do
    GenServer.start_link(__MODULE__, %{sys_status: :ok}, [name: :heartbeat])
  end

  def init(state) do
    :heart.set_callback(MyApp.Heartbeat, :check_status)
    {:ok, state}
  end

  def handle_call(:check_status, _, %{sys_status: status} = state) do
    {:reply, status, state}
  end

  def handle_cast(:down, _state) do
    {:noreply, %{sys_status: :down}}
  end
end


defmodule MyApp.SystemMonitor do
  use GenServer

  alias MyApp.Heartbeat

  def pulse(src) do
    # called from an event handler or similar to prove health
    GenServer.cast(:sysmon, :pulse)
  end

  def start_link(_ \\[]) do
    GenServer.start_link(__MODULE__, %{}, [name: :sysmon])
  end

  def init(state) do
    {:ok, state, 30_000}
  end

  def handle_info(:timeout, state) do
    Heartbeat.system_down()
    {:noreply, state, 30_000}
  end

  def handle_cast(:pulse, state) do
    {:noreply, state, 30_000}
  end
end

You will have to weigh what circumstances could arise with your application that would lead you to want to return something other than :ok from your :heart callback, and design a suitable scheme for assisting the current situation.

Once you test and deploy your solution, you can sleep a little easier, knowing that heart is helping to keep your devices online, doing the work you intended.


Doug Selph has been building software solutions for more than 25 years. He prefers functional languages, and in addition to Elixir, works in Clojure, ClojureScript, and Erlang. He also blogs here.

comments powered by Disqus