Greybeards

A lifetime ago in Internet years I had the good fortune of writing code for the largest data migration ever undertaken in the UK at that point in time between two of the largest banks in the country. It was a mixture of IBM S/390 mainframe assembler and an obscure IBM language called PL/1.

PL/1 defined data structures using the DCL statement. A typical array declaration might look like this:

DCL THINGS (
  1,
  2,
  3,
) CHAR(1);

Given the nature of the systems that I was working on (bank core accouting and ledgers) code reviews were detailed, per-statement per-line and undertaken multiple times by multiple groups before code made it anywhere near the production systems.

Looking at the declaration above, one might be forgiven to think that a change request to leave the rest of the program untouched, but to add another member to the array would be a simple affair. I submitted my completed code, which promptly failed the code review:

DCL THINGS (
  1,
  2,
  3,
  4
) CHAR(1);

Spotted it yet?

In the previous example, the last member of the array had a trailing comma, and yet I had neglected to include it in my change. This not be a big deal, right? The function of the program would be identical and the change has the required effect. So why was the change thrown out, what is so important about that syntactically insignificant trailing comma?

Including the trailing comma is actually a guard against future confusions in change requests. How so?

We will pretend that my change made it to live, and at some point in the future (which may be decades hence, in a banking environment) somebody else is required to add another member to that array, and appended the number 5 as a new member:

DCL THINGS (
  1,
  2,
  3,
  4,
  5,
) CHAR(1);

A diff between my incorrect change and the new change will look like this:

--- a   2013-09-03 11:17:49.118734712 +0100
+++ b   2013-09-03 11:18:15.374735686 +0100
@@ -2,5 +2,6 @@
   1,
   2,
   3,
-  4
+  4,
+  5,
 ) CHAR(1);

Our supposed one character, one-line change has produced a removal and two additions as far as a unified diff is concerned, because we have had to amend a line not directly related to our change to ensure the new form is syntactially correct. Looking at and understanding the diff metadata we can quickly understand that the net effect is the addition of the 5 character. However for systems that must be scrutinized in detail when being amended, like core banking systems, each of these additional cognitive steps introduces the likelyhood of error and of something untoward making it to live and ruining the savings of millions.

What we would really like the diff to show us is just the change we had made. Now consider the diff between the correctly modified file including the trailing comma after the digit 4, and our new change:

--- a   2013-09-03 11:28:04.686757559 +0100
+++ b   2013-09-03 11:18:15.374735686 +0100
@@ -3,4 +3,5 @@
   2,
   3,
   4,
+  5,
 ) CHAR(1);

Much better: there is only one delta and it directly reflects the change we have made and nothing more. We can confidently and categorically marry the specification ("append 5 to the THINGS declaration") and the change.

Chef Resource Notifications

At this point you might be asking what any of this arcane history has to do with Chef resource notifications.

The Opscode documentation tells us there are two ways in which we can have a change in one resource prompt the evaluation of another resource when a change to the former is encountered. The changing resource can notify the other resource, or the related resource can subscribe to the changing resource and 'watch' for changes. A common pairing is a service resource needing to restart after changes to a configuration file built dynamically from a template.

We also learn that there are two timings that can be specified that dictate when the related resource should act after learning of a change that needs a response: :delayed and :immediately.

Despite the distinction between notification and subscriptions seeming at first glance to be a bit arbitrary, much like our trailing comma, the decision as to which form to use actually carries quite a bit of weight for managing change, complexity and bit rot in a codebase over time. Likewise for :delayed versus :immediately.

Consider the canonical example:

service "amazingd" do
  supports :restart => true, :reload => true
  action [:start, :enable]
end

template "/etc/amazing.conf" do
  source "amazing.conf.erb"
  owner "root"
  notifies :restart, "service[amazingd]", :delayed
end

These two resources define a service that we want Chef to ensure is both running and configured to start after a reboot, and a configuration file built dynamically from a template. The template resource, when it has cause to amend the contents of the configuration file, notifies the service resource that it should restart in order to effect the new configuration.

Presented like this, it is fair to say there is no good reason why the service could not subscribe to the template. It would be prefectly clear what was happening. Likewise, if :delayed was changed to :immediately the daemon would still bounce (albeit somewhat obviously immediately, rather than at the end of the chef-client run) and effect the change to the configuration.

Organising Chef Recipes

As described in a fairly recent talk, bundling all of your resources into a single file (or worse, default.rb) is a bad idea in our opinion, once you get beyond a trivial amount of code and more than one or two contributors. We will split the example above into two files, plus the default, and also add some additional configuration files.

# default.rb
include_recipe "amazingd::configuration"
include_recipe "amazingd::service"

# service.rb
service "amazingd" do
  supports :restart => true, :reload => true
  action [:start, :enable]
end

# configuration.rb
template "/etc/amazing.conf" do
  source "amazing.conf.erb"
  owner "root"
  notifies :restart, "service[amazingd]", :delayed
end

template "/etc/amazingd.d/users.conf" do
  source "amazingd.d/users.conf.erb"
  owner "root"
  notifies :restart, "service[amazingd]", :delayed
end

template "/etc/amazingd.d/security.conf" do
  source "amazingd.d/security.conf.erb"
  owner "root"
  notifies :restart, "service[amazingd]", :delayed
end

Subscription Produces Additional Deltas

Keeping in mind what we learned about commas in PL/1, what can we say about this code now and the options open to us? Suppose we were to swap from having the templates notifying the service resource, to having the service resource subscribe to the templates:

# default.rb
include_recipe "amazingd::configuration"
include_recipe "amazingd::service"

# service.rb
service "amazingd" do
  supports :restart => true, :reload => true
  action [:start, :enable]
  subscribes :restart, "template[/etc/amazing.conf]", :delayed
  subscribes :restart, "template[/etc/amazingd.d/users.conf]", :delayed
  subscribes :restart, "template[/etc/amazingd.d/security.conf]", :delayed
end

# configuration.rb
template "/etc/amazing.conf" do
  source "amazing.conf.erb"
  owner "root"
end

template "/etc/amazingd.d/users.conf" do
  source "amazingd.d/users.conf.erb"
  owner "root"
end

template "/etc/amazingd.d/security.conf" do
  source "amazingd.d/security.conf.erb"
  owner "root"
end

Immediately we can see some problems

  • A change specification like: "generate the /etc/amazingd.d/groups.conf file from a template by querying in the Chef index" could easily lead to somebody who is less familiar with your Chef implementation simply amending configuration.rb and adding the additional resource and omitting the subscription from service.rb.

  • Somebody properly understanding the system must now amend 2 files: configuration.rb and service.rb in order to add an additional configuration file. A one-file specification generates a two file change delta and amends code not directly related to the matter in hand - code that we are now obliged to test, given it is no longer the same code that we had previously proven working.

Immediate Notification, Indeterminate State and Flapping Daemons

Notifications or subscriptions that use the :delayed timing are collected together and acted upon at the end of the Chef run. An additional nice side-effect is that multiple notifications to the same resource are collected and acted upon only once. This is worth noting for two reasons:

Firstly, the first time chef-client runs on a machine every resource is likely to be wrong. Using the :delayed timing in this example allows all of the configuration to converge into the correct state before the service is started or restarted. Also note the order of the recipes in default.rb: we are ensuring that the configuration is properly built before we start the service on the first run.

This is worthy of note as you can very easily break a "build time" run by using :immediately if you have multiple configuration files that contain boilerplate, for example. The first file will be configured, the rest will contain trash, and your daemon will promptly restart due to :immediately and bork on invalid configuration. Or worse still, the daemon will be running briefly with indeterminate configuration as valid defaults are retained and used as Chef works its way through setting up the configuration, restarting the daemon at every step.

Secondly, especially in the example we have constructed here, your daemon will "flap" as a change to the underlying data in the index used to build the three templates dynamically causes a restart for each configuration file. This can be diminished as a concern if your service in question supports :reload, as ours does here. However, concerns about breaking build runs and indeterminate configuration still apply. Flapping is generally a thing to avoid if you can: it complicates marrying process timing statistics with logging, for example, and can also lead to glitchy-edge-case monitoring and graphing anomalies.

Testing

At this point you may saying to yourself: "this is why we test!". Whilst you absolutely should be writing tests with one or more assertions for every resource in your cookbook, I argue that taking care over your change deltas still matters.

Tests are usually run at the conclusion of a chef-client run as report handlers, asserting that the end state is what we expect it to be. We cannot observe indeterminate configuration and flapping, for example.

When unit testing end state ("Is the service running? Do the configuration files contain the correct information?") it is common to neglect asserting that state changes such as notification and subscriptions have occured, too

Conclusion

Whilst there are no hard rules, we can learn something from the Greybeards working in obscure languages and apply it to Chef.

When presented with multiple configuration syntaxes, as we are in this case, careful selection of a standard whilst being mindful of future code maintenance, varying skill levels within a team and careful isolation of changes can help keep code clean and also help when tracking down changes for auditing or debugging. It can also reduce the likelyhood of bugs introduced by change, providing hints to future maintainers as to what they need to do to replicate some current behaviour for a new object.

It is also good trade-craft. Many system administrators moving to a configuration management world have comparatively little experience of both collaboration on a single codebase and source control in general, beyond committing and retrieving files. Being careful to isolate changes can help those less confident understand exactly what changed and when, as they become familiar with branching, merging, features and hotfixes.