How to detect why Ansible playbook hangs during execution
Asked Answered
G

10

69

Some of tasks I wrote start and never end. Ansible does not provide any errors or logs that would explain this, even with -vvvv option. Playbook just hangs and passing hours doesn't change anything.

When I try to run my tasks manually (by entering commands via SSH) everything is fine.

Example task that hangs:

- name: apt upgrade
  shell: apt-get upgrade

Is there any way to see stdout and stderr ? I tried:

- name: apt upgrade
  shell: apt-get upgrade
  register: hello
- debug: msg="{{ hello.stdout }}"
- debug: msg="{{ hello.stderr }}"

but nothing changed.

I do have required permissions and I pass correct sudo password - other tasks that require sudo execute correctly.

Gulosity answered 27/12, 2013 at 10:12 Comment(3)
You are passing the -K option?Rolandorolandson
Yes. But my problem was solved here groups.google.com/forum/#!topic/Ansible-project/mm99yAPVrfcGulosity
Ok cool. fyi, you should add the solution as an answer and accept it yourself..which will help others when they view this question.Rolandorolandson
P
26

Most Probable cause of your problem would be SSH connection. When a task requires a long execution time SSH timeouts. I faced such problem once, in order to overcome the SSH timeout thing, create a ansible.cfg in the current directory from which your are running Ansible add the following:

[ssh_connection]

ssh_args = -o ServerAliveInterval=n

Where n is the ServerAliveInterval (seconds) which we use while connecting to the server through SSH. Set it between 1-255. This will cause ssh client to send null packets to server every n seconds to avoid connection timeout.

Playbook answered 15/7, 2015 at 22:13 Comment(2)
The following fixed my woes: [ssh_connection]\n ssh_args = -o ServerAliveInterval=30 -o ControlMaster=auto -o ControlPersist=60sStoryteller
A small note. ServerAliveInterval=100 on its own slows down executing ansible tasks. You have to combine it with ControlMaster=auto -o ControlPersist=10mGusto
T
20

I was having same problems with a playbook.

It ran perfectly until some point then stopped so I've added async and poll parameters to avoid this behavior

- name: update packages full into each server
  apt: upgrade=full
  ignore_errors: True
  async: 60
  poll: 60

and it worked like a charm! I really don't know what happened but it seems now Ansible take in mind what's going on and don't freezes anymore !

Hope it helps

Twirp answered 10/4, 2015 at 14:5 Comment(2)
what's going on is that instead of sitting waiting on the command (and timing out on the ssh connection), ansible will check back on the command - in this case every 60 seconds to a maximum of 60 seconds (in other words, once). This sidesteps the issue of ssh timing out.Consequently
Mine ends up doing fatal: [fto-tctest03]: FAILED! => {"async_result": {"ansible_job_id": "j167156773363.3704", "finished": 0, "invocation": {"module_args": {"_async_dir": "/root/.ansible_async", "jid": "j167156773363.3704", "mode": "status"}}, "results_file": "/root/.ansible_async/j167156773363.3704", "started": 1, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}, "changed": false, "msg": "async task did not complete within the requested time - 60s"}Breach
C
7

I had the same issues and after a bit of fiddling around I found the problem to be in the step of gathering facts. Here are a few tips to better resolve any similar issue.

Disable fact-gathering in your playbook:

---
- hosts: myservers
  gather_facts: no
..

Rerun the playbook. If it works, then it means that the culprit is not in the SSH itself but rather in the script gathering the facts. We can debug that issue quite easily.

  1. SSH to the remote box
  2. Find the setup file somewhere in .ansible folder.
  3. Run it with ./setup or python -B setup

If it hangs, then we know that the problem is here for sure. To find excactly what makes it hang you can simply open the file with an editor and add print statements mainly in the populate() method of Facts. Rerun the script and see how long it goes.

For me the issue seemed to be trying to resolve the hostname at line self.facts['fqdn'] = socket.getfqdn() and with a bit of googling it turned out to be an issue with resolving the remote hostname.

Covarrubias answered 9/2, 2016 at 10:14 Comment(1)
What if my .ansible directory doesn't have a setup file, only a ./tmp directory which is also empty?Inebriate
R
6

In my case, ansible was "hanging forever" because apt-get was trying to ask me a question! How did I figure this out? I went to the target server and ran ps -aef | grep apt and then did a kill on the appropriate "stuck" apt-get command.

Immediately after I did that, my ansible playbook sprang back to life and reported (with ansible-playbook -vvv option given):

    " ==> Deleted (by you or by a script) since installation.",
    " ==> Package distributor has shipped an updated version.",
    "   What would you like to do about it ?  Your options are:",
    "    Y or I  : install the package maintainer's version",
    "    N or O  : keep your currently-installed version",
    "      D     : show the differences between the versions",
    "      Z     : start a shell to examine the situation",
    " The default action is to keep your current version.",
    "*** buildinfo.txt (Y/I/N/O/D/Z) [default=N] ? "

After reading that helpful diagnostic output, I immediately realized I needed some appropriate dpkg options (see for example, this devops post). In my case, I chose:

apt:
  name: '{{ item }}'
  state: latest
  update_cache: yes
  # Force apt to always update to the newer config files in the package:
  dpkg_options: 'force-overwrite,force-confnew'
loop: '{{ my_packages }}'

Also, don't forget to clean up after your killed ansible session with something like this, or your install will still likely fail:

sudo dpkg --configure -a
Reproachless answered 2/11, 2020 at 22:43 Comment(1)
Similar situation here. In my case, I added delegate_to: localhost to a task that simply created a temporary directory. After seeing your post, I poked around and saw sudo in the ansible processes. Added become: false to the task and it no longer hangs.Taler
C
3

A totally different work-around for me. I had this from a Debian Jessie (Linux PwC-Deb64 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux) to another Debian image I was trying to build in AWS.

After many of the suggestions here didn't work for me, I got suspicion around the SSH "shared" connection. I went to my ansible.cfg and found the ssh_args lines and set ControlMaster=no. This may now perform slowly because I've lost the SSH performance boost that this is supposed to give, but it seems like there is some interaction between this and apt-get that is causing the issue.

Your ansible.cfg could be in the directory that you run ansible from, or in /etc/ansible. If the latter, you may like to take a copy of it into a local directory before you start changing it!

Carcinomatosis answered 8/9, 2016 at 4:57 Comment(0)
A
1

My situation was that the commands Ansible was attempting to run required an additional tty for input. I was able to run ps -aef on the remote machine and find the last command that went through. I reran that command and found that it was indeed requiring an additional sudo password input on top of what ansible had already done. Luckily, I was able to do without the component, so I just removed it from the script.

A more robust solution would be to disable requiretty in the /etc/sudoers file. However, if you do that then make sure you revert it after the script runs because there are security implications that come with disabling that setting.

Agonistic answered 14/3, 2023 at 1:21 Comment(0)
G
1

One-liner that applies tasks to thousands of hosts (/tmp/hosts) under the timeout utility control. The script splits the list of hosts into chunks (25) and controls the execution of each chunk until the timeout (600s) expires. If the timeout is exceeded, the timeout utility kills the ansible-playbook on this iteration. This avoids stopping the ansible-playbook due to defunct friezes.

ANSIBLE_HOST_KEY_CHECKING=False /bin/bash -c 'while mapfile -n 25 ary && ((${#ary[@]})); do echo "${ary[@]}" | tr -d " " > /tmp/hosts.chunk; timeout 600 /usr/bin/ansible-playbook -v -b -i /tmp/hosts.chunk -u ansible -e "var_url_deb=https://server/files/pkg.deb" /var/www/html/git/remediations-gendbuntu/utils/unique-tasks/install-url-deb.yml; done < /tmp/hosts'

Repo link: https://github.com/skosachiov/remediations-gendbuntu/blob/main/utils/unique-tasks/at-template.sh

Gripe answered 25/6, 2023 at 7:33 Comment(0)
G
0

Removing the password of my SSH key fixed it for me, e.g.:

ssh-keygen -p
Graniah answered 1/4, 2018 at 14:33 Comment(2)
Answered on April 1? Well played, sir.Paramecium
its a shame but this really was the solution for me.. even with -k option and sshpass, was still hanging.. will have to look for something appropriate at a later timeDeherrera
S
0

I was using ansible to install a cluster of OpenDayLight SDN controllers on Ubuntu 20.4 VMs. Gathering facts was reporting a python version warning and hanging. Installing python 3.8 on my 3 VM worker nodes resolved the issue

Sammiesammons answered 9/3, 2022 at 9:19 Comment(0)
R
0

Env: Ansible running on Ubuntu / WSL

I my case i ran the playbook on "all" hosts but in my inventory i had the controller host which is itself ansible_connection: local

Targeting my playbook properly fixed my problem.

Rh answered 13/6, 2024 at 15:15 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.