How to debug Ansible issues?
Asked Answered
D

9

42

Sometimes, ansible doesn't do what you want. And increasing verbosity doesn't help. For example, I'm now trying to start coturn server, which comes with init script on systemd OS (Debian Jessie). Ansible considers it running, but it's not. How do I look into what's happening under the hood? Which commands are executed, and what output/exit code?

Dibs answered 23/2, 2017 at 13:25 Comment(0)
S
55

Debugging modules

  • The most basic way is to run ansible/ansible-playbook with an increased verbosity level by adding -vvv to the execution line.

  • The most thorough way for the modules written in Python (Linux/Unix) is to run ansible/ansible-playbook with an environment variable ANSIBLE_KEEP_REMOTE_FILES set to 1 (on the control machine).

It causes Ansible to leave the exact copy of the Python scripts it executed (either successfully or not) on the target machine.

The path to the scripts is printed in the Ansible log and for regular tasks they are stored under the SSH user's home directory: ~/.ansible/tmp/.

The exact logic is embedded in the scripts and depends on each module. Some are using Python with standard or external libraries, some are calling external commands.

Debugging playbooks

  • Similarly to debugging modules increasing verbosity level with -vvv parameter causes more data to be printed to the Ansible log

  • Since Ansible 2.1 a Playbook Debugger allows to debug interactively failed tasks: check, modify the data; re-run the task.

Debugging connections

  • Adding -vvvv parameter to the ansible/ansible-playbook call causes the log to include the debugging information for the connections.
Swatter answered 23/2, 2017 at 13:33 Comment(10)
Are you sure adding fourth -v makes a difference? From ansible-playbook's man page it doesn't.Dibs
If you run ansible-playbook without any parameter you will see: -v, --verbose : verbose mode (-vvv for more, -vvvv to enable connection debugging)Swatter
Can you elaborate on "connection debugging" part? I can see that -q in ssh options changes to -vvv, but no other changes. What's the difference?Dibs
So extra messages will appear in the systemd log on target system... Can you also suggest further actions when using ANSIBLE_KEEP_REMOTE_FILES=1? Module source is zipped, right? So no easy way to modify it.Dibs
Nothing is zipped. Don't distort reality.Swatter
Here's what I'm seeing at /root/.ansible/tmp/ansible-tmp-1487858817.39-188009196848514/service. I'm running ansible-2.1.2.0. And one more thing, is it possible to filter out ansible messages in systemd journal? journalctl -u ansible or journalctl -u ansible-basic.py doesn't cut it.Dibs
which version of ansible do you use? ansible started to zip modules since as early as 2.0, if I'm not mistaken.Dibs
@Dibs Does Python interpret zipped scripts?Swatter
Surely not. The files that are stored are files with python code. But the code is the loader plus module code zipped. The payload is zipped. That's what I meant. Please, expand on using ANSIBLE_KEEP_REMOTE_FILES=1. Feel free to use this for inspiration :) Or just add the link to your answer, which might be even better.Dibs
I see the four -vvvv also runs ssh with -vvv level if you are facing ssh connection issuesTrisomic
Z
17

Debugging Ansible tasks can be almost impossible if the tasks are not your own. Contrary to what Ansible website states.

No special coding skills needed

Ansible requires highly specialized programming skills because it is not YAML or Python, it is a messy mix of both.

The idea of using markup languages for programming has been tried before. XML was very popular in Java community at one time. XSLT is also a fine example.

As Ansible projects grow, the complexity grows exponentially as result. Take for example the OpenShift Ansible project which has the following task:

- name: Create the master server certificate
  command: >
    {{ hostvars[openshift_ca_host]['first_master_client_binary'] }} adm ca create-server-cert
    {% for named_ca_certificate in openshift.master.named_certificates | default([]) | lib_utils_oo_collect('cafile') %}
    --certificate-authority {{ named_ca_certificate }}
    {% endfor %}
    {% for legacy_ca_certificate in g_master_legacy_ca_result.files | default([]) | lib_utils_oo_collect('path') %}
    --certificate-authority {{ legacy_ca_certificate }}
    {% endfor %}
    --hostnames={{ hostvars[item].openshift.common.all_hostnames | join(',') }}
    --cert={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.crt
    --key={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.key
    --expire-days={{ openshift_master_cert_expire_days }}
    --signer-cert={{ openshift_ca_cert }}
    --signer-key={{ openshift_ca_key }}
    --signer-serial={{ openshift_ca_serial }}
    --overwrite=false
  when: item != openshift_ca_host
  with_items: "{{ hostvars
                  | lib_utils_oo_select_keys(groups['oo_masters_to_config'])
                  | lib_utils_oo_collect(attribute='inventory_hostname', filters={'master_certs_missing':True}) }}"
  delegate_to: "{{ openshift_ca_host }}"
  run_once: true

I think we can all agree that this is programming in YAML. Not a very good idea. This specific snippet could fail with a message like

fatal: [master0]: FAILED! => {"msg": "The conditional check 'item != openshift_ca_host' failed. The error was: error while evaluating conditional (item != openshift_ca_host): 'item' is undefined\n\nThe error appears to have been in '/home/user/openshift-ansible/roles/openshift_master_certificates/tasks/main.yml': line 39, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Create the master server certificate\n ^ here\n"}

If you hit a message like that you are doomed. But we have the debugger right? Okay, let's take a look what is going on.

master0] TASK: openshift_master_certificates : Create the master server certificate (debug)> p task.args
{u'_raw_params': u"{{ hostvars[openshift_ca_host]['first_master_client_binary'] }} adm ca create-server-cert {% for named_ca_certificate in openshift.master.named_certificates | default([]) | lib_utils_oo_collect('cafile') %} --certificate-authority {{ named_ca_certificate }} {% endfor %} {% for legacy_ca_certificate in g_master_legacy_ca_result.files | default([]) | lib_utils_oo_collect('path') %} --certificate-authority {{ legacy_ca_certificate }} {% endfor %} --hostnames={{ hostvars[item].openshift.common.all_hostnames | join(',') }} --cert={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.crt --key={{ openshift_generated_configs_dir }}/master-{{ hostvars[item].openshift.common.hostname }}/master.server.key --expire-days={{ openshift_master_cert_expire_days }} --signer-cert={{ openshift_ca_cert }} --signer-key={{ openshift_ca_key }} --signer-serial={{ openshift_ca_serial }} --overwrite=false"}
[master0] TASK: openshift_master_certificates : Create the master server certificate (debug)> exit

How does that help? It doesn't.

The point here is that it is an incredibly bad idea to use YAML as a programming language. It is a mess. And the symptoms of the mess we are creating are everywhere.

Some additional facts. Provision of prerequisites phase on Azure of Openshift Ansible takes on +50 minutes. Deploy phase takes more than +70 minutes. Each time! First run or subsequent runs. And there is no way to limit provision to a single node. This limit problem was part of Ansible in 2012 and it is still part of Ansible today. This fact tells us something.

The point here is that Ansible should be used as was intended. For simple tasks without the YAML programming. Fine for lots of servers but it should not be used for complex configuration management tasks.

Ansible is a not Infrastructure as Code ( IaC ) tool.

If you ask how to debug Ansible issues, you are using it in a way it was not intended to be used. Don't use it as a IaC tool.

Zollie answered 20/7, 2018 at 13:49 Comment(4)
Regarding you example code, you can simplify it by evaluating parts of it before executing the command (set_fact). And there is no way to limit provision to a single node. I provision a single node most of the time. Does this has anything to do with your playbook? it should not be used for complex configuration management tasks Can you recommend any other tool? Ansible is a not Infrastructure as Code ( IaC ) tool. I think it's still IaC, even if only for simple configuration management tasks.Dibs
Provision of prerequisites phase on Azure of Openshift Ansible takes on +50 minutes. It's not clear what makes it take so long. Are there other solutions which are significantly faster? This fact tells us something. Also, it takes second place on CodeTriage. What I personally don't like is that there's generally no easy way to change a one particular setting in a configuration file. You're forced to replace the whole configuration file occasionally, or write complex code.Dibs
If you ask how to debug Ansible issues, you are using it in a way it was not intended to be used. There are times when ansible behaves in a way you don't understand. Maybe you just failed to find it in the documentation, or it may just be a bug. But you've got a valid point. Generally, I suppose ansible is just imperfect. The idea is good, but it lacks features, plugins, modules, roles to make it come true.Dibs
That isn't any kind of "advanced programming". It's simply templating the parameters to a command, very much what templates are for. Nonetheless, if you're using the command module, then you've exposed a missing piece of functionality in ansible (not an architectural issue, though it certainly has those as well). And no, of course it isn't IaC... it's very much the opposite: infrastructure as DATA (which you can see clearly in your example: a bunch of nested dictionaries + a template).Raul
D
10

Here's what I came up with.

Ansible sends modules to the target system and executes them there. Therefore, if you change module locally, your changes will take effect when running playbook. On my machine modules are at /usr/lib/python2.7/site-packages/ansible/modules (ansible-2.1.2.0). And service module is at core/system/service.py. Anisble modules (instances of AnsibleModule class declared at module_utils/basic.py) has log method, which sends messages to systemd journal if available, or falls back to syslog. So, run journalctl -f on target system, add debug statements (module.log(msg='test')) to module locally, and run your playbook. You'll see debug statements under ansible-basic.py unit name.

Additionally, when you run ansible-playbook with -vvv, you can see some debug output in systemd journal, at least invocation messages, and error messages if any.

One more thing, if you try to debug code that's running locally with pdb (import pdb; pdb.set_trace()), you'll most likely run into BdbQuit exception. That's because python closes stdin when creating a thread (ansible worker). The solution here is to reopen stdin before running pdb.set_trace() as suggested here:

sys.stdin = open('/dev/tty')
import pdb; pdb.set_trace()
Dibs answered 23/2, 2017 at 13:55 Comment(2)
I haven't commented, well, It's a very nice write up, I really like the fact you indeed had to actually run the debugger :), and luckily only the local code :)Eightfold
Thank you for the stdin tip! This worked with pdb but didn't work with ipdb. When running my playbook on localhost, though, I see that FDs 0, 1, and 2 are all still pointing at my TTY, so I used sys.stdin = os.fdopen(0, 'r') instead and ipdb was happy with that.Counterfeit
E
5

Debugging roles/playbooks

Basically debugging ansible automation over big inventory across large networks is none the other than debugging a distributed network application. It can be very tedious and delicate, and there are not enough user friendly tools.

Thus I believe the also answer to your question is a union of all the answers before mine + small addition. So here:

  • absolutely mandatory: you have to want to know what's going on, i.e. what you're automating, what you are expecting to happen. e.g. ansible failing to detect service with systemd unit as running or as stopped usually means a bug in service unit file or service module, so you need to 1. identify the bug, 2. Report the bug to vendor/community, 3. Provide your workaround with TODO and link to bug. 4. When bug is fixed - delete your workaround

  • to make your code easier to debug use modules, as much as you can

  • give all tasks and variables meaningful names.

  • use static code analysis tools like ansible-lint. This saves you from really stupid small mistakes.

  • utilize verbosity flags and log path

  • use debug module wisely

  • "Know thy facts" - sometimes it is useful to dump target machine facts into file and pull it to ansible master

    • use strategy: debugin some cases you can fall into a task debugger at error. You then can eval all the params the task is using, and decide what to do next

    • the last resort would be using Python debugger, attaching it to local ansible run and/or to remote Python executing the modules. This is usually tricky: you need to allow additional port on machine to be open, and if the code opening the port is the one causing the problem?

Also, sometimes it is useful to "look aside" - connect to your target hosts and increase their debuggability (more verbose logging)

Of course log collection makes it easier to track changes happening as a result of ansible operations.

As you can see, like any other distributed applications and frameworks - debug-ability is still not as we'd wish for.

Filters/plugins

This is basically Python development, debug as any Python app

Modules

Depending on technology, and complicated by the fact you need to see both what happens locally and remotely, you better choose language easy enough to debug remotely.

Eightfold answered 11/7, 2017 at 0:23 Comment(1)
@Dibs you can try using an IDE as well, e.g. pycharm works pretty ok. But of course beware to point it to the right Python, open a port, etc.Eightfold
H
4

You could use register module, and debug module to print the return values. For example, I want to know what is the return code of my script execution called "somescript.sh", so I will have my tasks inside the play such as:

- name: my task
  shell: "bash somescript.sh"
  register: output

- debug:
  msg: "{{ output.rc }}"

For full return values you can access in Ansible, you can check this page: http://docs.ansible.com/ansible/latest/common_return_values.html

Hyponitrite answered 22/11, 2017 at 15:7 Comment(0)
C
2

There are multiple levels of debugging that you might need but the easiest one is to add ANSIBLE_STRATEGY=debug environment variable, which will enable the debugger on the first error.

Crotchety answered 16/11, 2017 at 14:18 Comment(0)
R
2

1st approach: Debugging Ansible module via q module and print the debug logs via the q module as q('Debug statement'). Please check q module page to check where in tmp directory the logs would get generated in the majority of the case either it'll be generated either at: $TMPDIR\q or \tmp\q, so one can do tail -f $TMPDIR\q to check the logs generated once the Ansible module play runs (ref: q module).

2nd Approach: If the play is running on localhost one can use pdb module to debug the play following respective doc: https://docs.ansible.com/ansible/latest/dev_guide/debugging.html

3rd Approach: Using Ansible debug module to print the play result and debug the module(ref: Debug module).

Rhody answered 18/3, 2019 at 6:14 Comment(4)
Regarding the 1st approach, there's a lot you omit. You've got to make it left the task on the server, then you explode it, and at that point you can just go with pdb.Dibs
I am not sure what you meant by left the task on the server, as ansible being agentless, q module debug statement should be running on the system from where ansible play is fired and the debug statement is also generated in the same system tmp dir, coz pdb won't help you are trying to debug over connections other than localhost.Rhody
When ansible executes a task it usually copies a script that performs the task to the server (to ~/.ansible/tmp), then executes it. And normally after executing a task, it removes the script. By "leave the task on the server" I mean making it not remove the script. See ANSIBLE_KEEP_REMOTE_FILES environment variable. Then, you're saying that q is to be run on the control node (not on managed ones). That it will write to the tmp dir on the control node. Which means it's only able to debug code executing on the control node. As is the case with pdb. What's the difference?Dibs
...Meaning, the way it seems, both pdb and q won't let you debug code that is executed remotely. If that's not the case with q, how is it able to achieve that? Achieve debugging code on a managed node, and write to tmp on the control node. Please correct me where I'm wrong.Dibs
W
0

You can try using aiansible to debug: https://github.com/sunnycloudy/aiansible

DEBUG INFO:
/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/kubespray-defaults/tasks/main.yaml:2


    2|- name: Configure defaults
    3|  debug:
    4|    msg: "Check roles/kubespray-defaults/defaults/main.yml"
    5|  tags:
    6|    - always
    7|
    8|# do not run gather facts when bootstrap-os in roles
    9|- name: set fallback_ips
   10|  import_tasks: fallback_ips.yml
   11|  when:


Saturday 25 May 2024  23:07:13 +0800 (0:00:00.101)       10:20:04.700 ********* 

TASK [kubespray-defaults : Configure defaults] *****************************************************************************************************************************************************************
ok: [test1] => {
    "msg": "Check roles/kubespray-defaults/defaults/main.yml"
}
Aiansible(CN) => result._result
{'msg': 'Check roles/kubespray-defaults/defaults/main.yml', '_ansible_verbose_always': True, '_ansible_no_log': False, 'changed': False}
Aiansible(CN) => bt
0:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/playbooks/ansible_version.yml:11=>Check 2.11.0 <= Ansible version < 2.13.0
1:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/playbooks/ansible_version.yml:20=>Check that python netaddr is installed
2:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/playbooks/ansible_version.yml:28=>Check that jinja is not too old (install via pip)
3:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:2=>download : prep_download | Set a few facts
4:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:8=>download : prep_download | On localhost, check if passwordless root is possible
5:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:23=>download : prep_download | On localhost, check if user has access to the container runtime without using sudo
6:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:38=>download : prep_download | Parse the outputs of the previous commands
7:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:48=>download : prep_download | Check that local user is in group or can become root
8:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:59=>download : prep_download | Register docker images info
9:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:68=>download : prep_download | Create staging directory on remote node
10:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/prep_download.yml:78=>download : prep_download | Create local cache for files and images on control node
11:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/main.yml:10=>download : download | Get kubeadm binary and list of required images
12:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/download/tasks/main.yml:19=>download : download | Download files / images
13:/root/.nujnus/test_suite/K8s_v2_22_2/install_k8s_v2_22_2/install/kubespray/roles/kubespray-defaults/tasks/main.yaml:2=>kubespray-defaults : Configure defaults
Aiansible(CN) => a
msg: Check roles/kubespray-defaults/defaults/main.yml


Wonted answered 25/5, 2024 at 15:5 Comment(0)
S
0
TASK [set_fact] **********************************************
fatal: [172.27.1.180]: FAILED! => {"msg": "template error while templating string: Could not load \"lib_utils_oo_select_keys\": 'lib_utils_oo_select_keys'. String: {{ hostvars | lib_utils_oo_select_keys(hostvars,openshift_master_etcd_hosts_group) | lib_utils_oo_collect('openshift.common.ip') | default([]) | join(',') }}. Could not load \"lib_utils_oo_select_keys\": 'lib_utils_oo_select_keys'"}

I'm running ansible prerequisites.yml file from release-3.11 branch to setup openshift cluster. What is the issue with this ? Ansible version 2.16.6

Syndicalism answered 21/6, 2024 at 10:36 Comment(1)
If you have a new question, please ask it by clicking the Ask Question button. Include a link to this question if it helps provide context. - From ReviewAfb

© 2022 - 2025 — McMap. All rights reserved.