An Ansible safeguard

Monday, July 18th, 2022

At my work, we use ansible to provision all kinds of things, from servers to monitoring. Ansible is very powerful, but with great power comes great responsibility. One downside of automating many things with ansible is that you could also accidentally destroy a lot of things with a single wrong command.

In a perfect world where everything is managed by ansible, this wouldn’t be much of a problem. However we rarely, if ever, live in a perfect world. and real deployments can drift from what is configured in ansible. The platforms we manage are not always fully under our control and can get pretty non-homogenous.

So we were looking for something like a safeguard to protect us from ourselves; to prevent us from accidentally invoking ansible in a destructive way. This article describes our solution to this problem.

Our Ansible setup

First, a little bit about how our ansible is set up. The safeguard in this article may not work for setups that differ from ours.

We have a single, overarching, site.yml playbook that determines which tasks to run for which machines, tags, etc. I’m not going to show the whole thing, but the gist of it looks like this:

- hosts: all
  roles:
    - role: common
      tags: ["common"]

    - role: firewall
      tags: ["firewall"]

    - role: certificate
      tags: ["certificate", "webserver"]
      when: "'certificate' in group_names"

Basically this says to apply the common and firewall roles to all machines and only apply the certificate role if it’s in the certificate group. We then have a hosts file that looks a little like this:

db1.example.com
web1.example.com

[certificate]
web1.example.com

So db1 and web1 will get the common and firewall roles and web1 will get the certificate role in addition.

We then execute ansible like so:

$ ansible-playbook -b -K site.yml -t certificate -l web1.example.com

I’m not entirely sure, but I think this is a pretty common setup.

The potentially dangerous problems

We have a few roles that are potentially dangerous. For example, the webserver role will deploy a webserver. However, as time progresses, the actually deployed configuration of the webserver in production can drift from what is configured in ansible. Yes, this is something that, in theory, should never happen. Unfortunately, we live in the real world where things are not always perfect for various reasons. In our situation, we can’t always be fully in control, and it’s something we just have to be pragmatic about.

There are also roles and tags that are just inherently dangerous. For instance, we have a few tags that always require restarts of services, which may cause disruptions if done during office hours.

Then there’s the problem of overly broad host specifications. For example, if we accidentally forget to specify a host limit or a tag, or we make a typo, we may inadvertently role out way too many changes.

We wanted a way to prevent us from accidentally making these mistakes, but still allow us to overrule any safeguards if we were sure it was the right thing to do.

Our solution

What we came up with is a special task at the top of site.yml that always runs, regardless of what tags or limits you specify:

- name: Safeguard
  # Always run regardless of what tags or limits the the user specifies.
  hosts: all

  connection: local
  become: no
  gather_facts: false

  tasks:
    # Call a local script in the repo that will perform some safety checks.
    - name: Check hosts and tags
      ansible.builtin.shell:
        cmd: tools/safeguard.py
      delegate_to: localhost
      run_once: true

      # Pass some information off this ansible run to the script via the
      # environment.
      environment:
        safeguard_limit: "{{ ansible_limit|default('') }}"
        safeguard_hosts: "{{ ansible_play_hosts }}"
        safeguard_tags: "{{ ansible_run_tags }}"

        # The user can override safety guards by setting these variables
        # using '-e sg_nolimit=yes'
        sg_nolimit: "{{ sg_nolimit|default('BREAKBAD') }}"
        sg_notag: "{{ sg_notag|default('BREAKBAD') }}"
        sg_dangertag: "{{ sg_dangertag|default('BREAKBAD') }}"
        sg_manyhosts: "{{ sg_manyhosts|default('BREAKBAD') }}"

      # The 'always' tag is special in ansible and will always match regardless
      # of which tags you specify (including none at all).
      tags:
        - always

      changed_when: False

I’m not going to explain this task in detail, you can read the comments in it to fully understand it. Basically, it passes some of the current ansible run information such as the user-specified tags and limits to a script, which will check for potentially dangerous things, such as not specifying a limit. The user can override these checks by setting various variables using -e sg_XXXX.

So, for example, the user must specify a host limit using -l. Otherwise, the playbook will execute on all hosts that match it, and that may not be what you intended. You can override this safeguard like so:

$ ansible-playbook -b -K site.yml -e sg_nolimit=yes -t certificate

This will probably also trigger the “manyhosts” safeguard, which checks that you’re not specifying too many hosts at the same time. So you’d also have to override that safeguard:

$ ansible-playbook -b -K site.yml -e sg_nolimit=yes -e sg_manyhosts=yes -t certificate

The safeguard script

The safeguard.py script looks like this:

#!/usr/bin/env python3

# Safeguard script, executed by the first task in site.yml.

import ast
import os
import sys


def check_constraint(cb, override, err_msg):
    """
    Wrapper function around constraint_ functions. Does some boilerplate such
    as checking for overrides.
    """
    if override in os.environ and os.environ[override] == 'yes':
        # User has overriden this constraint with an extra var.
        return

    # Call the callback. If it doesn't return True, abort.
    if cb() is not True:
        sys.stderr.write("{}. Override with '{}=yes'.\n".format(
            err_msg,
            override)
        )
        sys.exit(1)

def constraint_nolimit():
    """
    The user should specify a limit with '-l' or '--limit'. If not, this var
    will be empty.
    """
    # If this is not empty, it's fine
    if os.environ["safeguard_limit"] != "":
        return True

def constraint_notags():
    """
    The user should specify a tag. If not, the value here becomes 'all'. Stop
    if it is.
    """
    tags = ast.literal_eval(os.environ["safeguard_tags"])
    if len(tags) > 0 and "all" not in tags:
        return True

def constraint_dangertags():
    """
    Some tags are a bit dangerous
    """
    # FIXME: Hardcoded
    danger_tags = ["common", "webserver"]
    tags = ast.literal_eval(os.environ["safeguard_tags"])
    for tag in tags:
        if tag in danger_tags:
            return False

    return True

def constraint_manyhosts():
    """
    Executing stuff on many hosts may not be a good idea.
    """
    if os.environ["sg_nolimit"] == "yes":
        return True

    hosts = ast.literal_eval(os.environ["safeguard_hosts"])
    if len(hosts) < 4:
        return True

    return False


if __name__ == "__main__":
    check_constraint(constraint_nolimit, "sg_nolimit", "No limit specified")
    check_constraint(constraint_notags, "sg_notag", "No tag(s) specified")
    check_constraint(constraint_dangertags, "sg_dangertag", "Dangerous tags specified")
    check_constraint(constraint_manyhosts, "sg_manyhosts", "Too many hosts specified")

I've reduced the script a bit for clarity. Again, I'm not going to fully explain how it works. If you can read a little bit of Python, its workings should be self-evident. There's a bit of dynamic dispatch magic in it to call the various constraint_ methods. Not something I usually recommend as it can lead to unclear call stacks pretty quickly, but in such a small script it's not much of a problem.

Conclusion