Lessons from working and playing with things like AWS, Ansible, and others
Debugging Ansible for fun and no profit
28 Apr 2014
A colleague reported some strange behaviour regarding Ansible, in particular with pgrep and pkill in the shell module.
I created a simplish test-case (I could have made it simpler but wanted to make it safish for other people to use)
Running this with ansible-playbook pkilldemo.yml -v shows the problem:
So running pkill -IO -f pattern_to_search_for || true returns -29
(where SIGIO is signal 29 on my machine).
Adding the following to the above playbook allows us to see different
success and failure scenarios:
After running the resulting playbook (that playbook and output are
available as a gist),
the known facts are these:
Running pkill -IO -f pattern_to_search_for || true on the host itself
outside of ansible exits with status 0
Running pkill -IO pattern_to_search_for || true through Ansible returns 0
Running pgrep -f pattern_to_search_for || true through Ansible returns 0
Running pkill -IO -f pattern_to_search_for through Ansible
exit status 1, a known exit code according to man pkill
Setting up debugging in Ansible
I tend to use pdb for debugging with the python script that actually gets generated by Ansible. The creator of
Ansible, Michael Dehaan (@laserllama) suggests
epdb for Ansible debugging.
As Michael suggests in that blogpost, it’s very important to set forks to 1. You can do this
temporarily using --forks=1 on the command line but I tend to do so little stuff that needs parallelism
that I have forks = 1 in my ansible config file!
To obtain the module python script, you can set the ANSIBLE_KEEP_REMOTE_FILES environment variable, either
using export ANSIBLE_KEEP_REMOTE_FILES=1 or
So I can then run python /home/will/.ansible/tmp/ansible-tmp-1401256918.15-50781745619374/command to repeat the task
At this point it’s probably worth a brief exposition of how the shell
module works in Ansible.
Really it just kicks off the command module with a USE_SHELL=1
module. So I’m going to be
looking at the command module source
from the latest commit at time of writing.
What Ansible does with that module is to replace line 162 with the contents of
the module_utils/basic code
and then swap a few template tokens (e.g. "<<INCLUDE_ANSIBLE_MODULE_ARGS>>" gets replaced with 'pkill -IO -f Ki5ViZNY4EIRaf2JDqvQ || true #USE_SHELL').
Anyway, the end result is that the 213 line command module gets expanded to 1419 lines.
The line I’m interested is the same either way - it’s line 140,
where the command gets executed. So I open up the script and add
just before line 140, and run it again.
The python docs have some good instructions on how to use the debugger.
After stepping through the debugger enough, I realise the eventual
result is that python is going to kick off a shell that looks like
And, sure enough, that matches itself, and kills itself before it
gets to the true:
where 157 is 128 + 29.
When a command terminates on a fatal signal N, bash uses the value of 128+N as the exit status
This explains why pgrep does not fail, and why pkill without -f
does not fail.
It’s not really an Ansible bug (one could argue that maybe process
management is a missing module), but it was instructive to use Ansible
debugging techniques to get to the cause.
And the actual solution to the problem was to use ignore_errors rather
than using true in the shell to hide the error.