An Ansible safeguard
Monday, July 18th, 2022
At my work, we use ansible to provision all kinds of things, from servers to monitoring. Ansible is very powerful, but with great power comes great responsibility. One downside of automating many things with ansible is that you could also accidentally destroy a lot of things with a single wrong command.
In a perfect world where everything is managed by ansible, this wouldn’t be much of a problem. However we rarely, if ever, live in a perfect world. and real deployments can drift from what is configured in ansible. The platforms we manage are not always fully under our control and can get pretty non-homogenous.
So we were looking for something like a safeguard to protect us from ourselves; to prevent us from accidentally invoking ansible in a destructive way. This article describes our solution to this problem.
Our Ansible setup
First, a little bit about how our ansible is set up. The safeguard in this article may not work for setups that differ from ours.
We have a single, overarching, site.yml playbook that determines which tasks to run for which machines, tags, etc. I’m not going to show the whole thing, but the gist of it looks like this:
- hosts: all roles: - role: common tags: ["common"] - role: firewall tags: ["firewall"] - role: certificate tags: ["certificate", "webserver"] when: "'certificate' in group_names"
Basically this says to apply the common and firewall roles to all machines and only apply the certificate role if it’s in the certificate group. We then have a hosts file that looks a little like this:
db1.example.com web1.example.com [certificate] web1.example.com
So db1 and web1 will get the common and firewall roles and web1 will get the certificate role in addition.
We then execute ansible like so:
$ ansible-playbook -b -K site.yml -t certificate -l web1.example.com
I’m not entirely sure, but I think this is a pretty common setup.
The potentially dangerous problems
We have a few roles that are potentially dangerous. For example, the webserver role will deploy a webserver. However, as time progresses, the actually deployed configuration of the webserver in production can drift from what is configured in ansible. Yes, this is something that, in theory, should never happen. Unfortunately, we live in the real world where things are not always perfect for various reasons. In our situation, we can’t always be fully in control, and it’s something we just have to be pragmatic about.
There are also roles and tags that are just inherently dangerous. For instance, we have a few tags that always require restarts of services, which may cause disruptions if done during office hours.
Then there’s the problem of overly broad host specifications. For example, if we accidentally forget to specify a host limit or a tag, or we make a typo, we may inadvertently role out way too many changes.
We wanted a way to prevent us from accidentally making these mistakes, but still allow us to overrule any safeguards if we were sure it was the right thing to do.
Our solution
What we came up with is a special task at the top of site.yml that always runs, regardless of what tags or limits you specify:
- name: Safeguard # Always run regardless of what tags or limits the the user specifies. hosts: all connection: local become: no gather_facts: false tasks: # Call a local script in the repo that will perform some safety checks. - name: Check hosts and tags ansible.builtin.shell: cmd: tools/safeguard.py delegate_to: localhost run_once: true # Pass some information off this ansible run to the script via the # environment. environment: safeguard_limit: "{{ ansible_limit|default('') }}" safeguard_hosts: "{{ ansible_play_hosts }}" safeguard_tags: "{{ ansible_run_tags }}" # The user can override safety guards by setting these variables # using '-e sg_nolimit=yes' sg_nolimit: "{{ sg_nolimit|default('BREAKBAD') }}" sg_notag: "{{ sg_notag|default('BREAKBAD') }}" sg_dangertag: "{{ sg_dangertag|default('BREAKBAD') }}" sg_manyhosts: "{{ sg_manyhosts|default('BREAKBAD') }}" # The 'always' tag is special in ansible and will always match regardless # of which tags you specify (including none at all). tags: - always changed_when: False
I’m not going to explain this task in detail, you can read the comments in it to fully understand it. Basically, it passes some of the current ansible run information such as the user-specified tags and limits to a script, which will check for potentially dangerous things, such as not specifying a limit. The user can override these checks by setting various variables using -e sg_XXXX.
So, for example, the user must specify a host limit using -l. Otherwise, the playbook will execute on all hosts that match it, and that may not be what you intended. You can override this safeguard like so:
$ ansible-playbook -b -K site.yml -e sg_nolimit=yes -t certificate
This will probably also trigger the “manyhosts” safeguard, which checks that you’re not specifying too many hosts at the same time. So you’d also have to override that safeguard:
$ ansible-playbook -b -K site.yml -e sg_nolimit=yes -e sg_manyhosts=yes -t certificate
The safeguard script
The safeguard.py script looks like this:
#!/usr/bin/env python3 # Safeguard script, executed by the first task in site.yml. import ast import os import sys def check_constraint(cb, override, err_msg): """ Wrapper function around constraint_ functions. Does some boilerplate such as checking for overrides. """ if override in os.environ and os.environ[override] == 'yes': # User has overriden this constraint with an extra var. return # Call the callback. If it doesn't return True, abort. if cb() is not True: sys.stderr.write("{}. Override with '{}=yes'.\n".format( err_msg, override) ) sys.exit(1) def constraint_nolimit(): """ The user should specify a limit with '-l' or '--limit'. If not, this var will be empty. """ # If this is not empty, it's fine if os.environ["safeguard_limit"] != "": return True def constraint_notags(): """ The user should specify a tag. If not, the value here becomes 'all'. Stop if it is. """ tags = ast.literal_eval(os.environ["safeguard_tags"]) if len(tags) > 0 and "all" not in tags: return True def constraint_dangertags(): """ Some tags are a bit dangerous """ # FIXME: Hardcoded danger_tags = ["common", "webserver"] tags = ast.literal_eval(os.environ["safeguard_tags"]) for tag in tags: if tag in danger_tags: return False return True def constraint_manyhosts(): """ Executing stuff on many hosts may not be a good idea. """ if os.environ["sg_nolimit"] == "yes": return True hosts = ast.literal_eval(os.environ["safeguard_hosts"]) if len(hosts) < 4: return True return False if __name__ == "__main__": check_constraint(constraint_nolimit, "sg_nolimit", "No limit specified") check_constraint(constraint_notags, "sg_notag", "No tag(s) specified") check_constraint(constraint_dangertags, "sg_dangertag", "Dangerous tags specified") check_constraint(constraint_manyhosts, "sg_manyhosts", "Too many hosts specified")
I've reduced the script a bit for clarity. Again, I'm not going to fully explain how it works. If you can read a little bit of Python, its workings should be self-evident. There's a bit of dynamic dispatch magic in it to call the various constraint_ methods. Not something I usually recommend as it can lead to unclear call stacks pretty quickly, but in such a small script it's not much of a problem.
Conclusion
This safeguard construction has been working well for us. While actually fixing the dangerous situations is always preferable, sometimes in real life things get messy and an extra hurdle can prevent accidental damage. This solution, coupled with --check and various protections in the roles themselves, have so far prevented us from creating accidental production disruptions.