erlangelixirdistillery

How to get elixir nodes to connect automatically on startup?


Background

I'm trying to setup clustering between a few elixir nodes. My understanding is that I can set this up by modifying the release vm.args. I'm using Distillery to build releases and am following the documentation here: https://hexdocs.pm/distillery/config/runtime.html.

My rel/vm.args file is as follows:

-name <%= release_name %>@${HOSTNAME}
-setcookie <%= release.profile.cookie %>
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '[${SYNC_NODES_MANDATORY}]'

I have a build server running Ubuntu 18.04 and two webservers running Ubuntu 18.04. I'm building the release on the build server, copying the archive to the webservers and, unarchiving it and starting it there.

On the server the two vm.args files are calculated to be:

-name hifyre_platform@10.10.10.100
-setcookie wefijow89236wj289*PFJ#(*98j3fj()#J()#niof2jio
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '["\'my_app@10.10.10.100\'","\'my_app@10.10.10.200\'"]'

and

-name hifyre_platform@10.10.10.200
-setcookie wefijow89236wj289*PFJ#(*98j3fj()#J()#niof2jio
-smp auto
-kernel inet_dist_listen_min 9100 inet_dist_listen_max 9155
-kernel sync_nodes_mandatory '["\'my_app@10.10.10.100\'","\'my_app@10.10.10.200\'"]'

The releases are run via systemd with the following configuration:

[Unit]
Description=My App
After=network.target

[Service]
Type=simple
User=ubuntu
Group=ubuntu
WorkingDirectory=/opt/app
ExecStart=/opt/app/bin/my_app foreground
Restart=on-failure
RestartSec=5
Environment=PORT=8080
Environment=LANG=en_US.UTF-8
Environment=REPLACE_OS_VARS=true
Environment=HOSTNAME=10.10.10.100
SyslogIdentifier=my_app
RemainAfterExit=no

[Install]
WantedBy=multi-user.target

Problem

The releases start fine on both servers and but when I open a remote console and run Node.list() the result is an empty list unless I manually connect the two nodes.

If I manually run Node.connect(:"my_app@10.10.10.200") I then see the other node when running Node.list() on each node, but this does not happen automatically on startup.


Solution

  • The vm.args file ends up getting passed to Erlang using the -args_file argument. I went to look at the documentation for -args_file, and found that it's actually not very well documented. It turns out that vm.args is like an onion, in that it has lots of layers, and the documentation seems to be all in the source code.

    Let's start with where we want to end up. We want sync_nodes_mandatory to be a list of atoms, and we need to write it in Erlang syntax. If we were using short node names, e.g. my_app@myhost, we could get away with not quoting the atoms, but atoms with dots in them need to be quoted using single quotes:

    ['my_app@10.10.10.100','my_app@10.10.10.200']
    

    We want this to be the output of the function build_args_from_string in erlexec.c. This function has four rules:

    So since we want to pass the single quotes through to the parser, we have two alternatives. We can escape the single quotes:

    [\'my_app@10.10.10.100\',\'my_app@10.10.10.200\']
    

    Or we can enclose the single quotes in double quotes:

    ["'my_app@10.10.10.100','my_app@10.10.10.200'"]
    

    (In fact, it doesn't matter how many and where we put the double quotes, as long as every occurrence of a single quote is inside a pair of double quotes. This is just one possible way of doing it.)

    BUT if we choose to escape the single quotes with backslashes, we encounter another layer! The function read_args_file is the function that actually reads the vm.args file from disk before passing it to build_args_from_string, and it imposes its own rules first! Namely:

    So if we were to write [\'my_app@10.10.10.100\',\'my_app@10.10.10.200\'] in vm.args, read_args_file would eat the backslashes, and build_args_from_string would eat the single quotes, leaving us with an invalid term and an error:

    $ iex --erl '-args_file /tmp/vm.args'
    2019-04-25 17:00:02.966277 application_controller: ~ts: ~ts~n
        ["syntax error before: ","'.'"]
        "[my_app@10.10.10.100,my_app@10.10.10.200]"
    {"could not start kernel pid",application_controller,"{bad_environment_value,\"[my_app@10.10.10.100,my_app@10.10.10.200]\"}"}
    could not start kernel pid (application_controller) ({bad_environment_value,"[my_app@10.10.10.100,my_app@10.10.10.200]"})
    
    Crash dump is being written to: erl_crash.dump...done
    

    So we could either use double backslashes:

    -kernel sync_nodes_mandatory [\\'my_app@10.10.10.100\\',\\'my_app@10.10.10.200\\']
    

    Or just stick with double quotes (a different, equally valid, variant this time):

    -kernel sync_nodes_mandatory "['my_app@10.10.10.100','my_app@10.10.10.200']"
    

    As noted in the documentation for the kernel application, you also need to set sync_nodes_timeout to a time in milliseconds or infinity:

    Specifies the time (in milliseconds) that this node waits for the mandatory and optional nodes to start. If this parameter is undefined, no node synchronization is performed.

    Add something like:

    -kernel sync_nodes_timeout 10000