How do you make Terraform wait for cloud-init to finish?
Asked Answered
T

2

13

In my Terraform AWS Docker Swarm module I use cloud-init to initialize the EC2 instance. However, Terraform says the resource is ready before cloud-init finishes. Is there a way of making it wait for cloud-init to finish, ideally without SSHing or checking for a port to be up using a null resource?

Testate answered 31/5, 2020 at 13:5 Comment(1)
This is a great question. I was thinking about this the other day for Packer. Likely a similar solution exists for both. I'll do some thinking and try to answer this.Koheleth
H
8

Your managers and workers both use template_cloudinit_config. They also have ec2:CreateTags.

You can use an EC2 resource tag like trajano/terraform-docker-swarm-aws/cloudinit-complete to indicate that the cloudinit has finished.

You could add this final part to each to invoke a tagging script:

part { filename = "tag_complete.sh" content = local.tag_complete_script content_type = "text/x-shellscript" }

And declare tag_complete_script be the following:

locals {
  tag_complete_script = <<-EOF
  #!/bin/bash
  instance_id="${TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-id}"
  aws ec2 create-tags --resources "$instance_id" --tags 'Key=trajano/terraform-docker-swarm-aws/cloudinit-complete,Value=true'
  EOF
}

Then with a null_resource, you wait for the tag to appear (wrote this on my phone, so use it for a general idea, but I don't expect that it will work without testing and edits):

resource "null_resource" "wait_for_cloudinit" {
  provisioner "local-exec" {
    command = <<-EOF
    #!/bin/bash
    poll_tags="aws ec2 describe-tags --filters 'Name=resource-id,Values=${join(",", aws_instance.managers[*].id)}' 'Name=key,Values=trajano/terraform-docker-swarm-aws/cloudinit-complete' --output text --query 'Tags[*].Value'"
    expected='${join(",", formatlist("true", aws_instance.managers[*].id))}'
    $tags="$($poll_tags)"
    while [[ "$tags" != "$expected" ]] ; do
      $tags="$($poll_tags)"
    done
    EOF
  }
}

This way you can have dependencies on null_resource.wait_for_cloudinit on any resources that need to run after cloudinit has completed.

Hesitation answered 31/5, 2020 at 15:34 Comment(8)
This presumes your local machine is UNIX, but I like the idea.Testate
Good point! It might be better to invoke Python in the local-exec in a cross-platform friendly way and just note that Python is a pre-requisite. It would be pretty easy to write this poller using boto3 directly rather than AWS CLI.Koheleth
Yup, what I was thinking was using an SSH remote_exec as per my OP, but I wanted to avoid that as well because it is assuming the private key for the service is available in the terraformerTestate
Indeed. I was wracking my brain for a way around ssh or port checks. Tags seemed like a decent option. Running the Terraform build in a Docker container or WSL seems like a tolerable requirement for Windows users, but I haven't used Windows for work in several years, so I probably shouldn't speak for that user base.Koheleth
There may be a flaw with the tag implementation if the tag is applied to a terraform managed resource as it will strip it off.Testate
If the Terraform state has the exact tag key in it for those EC2 instances, it would conflict with the unmanaged tag applied by the cloudinit. AFAIK, resource tags are handled bythe provider individually by key, not as a group. I might have a bad assumption there.Koheleth
I think a good way of testing is to run apply twice to see if there's any changes.Testate
Did what I suggest break after you ran it twice? If not, can you accept it as the answer here.Koheleth
E
3

Another possible approach is using AWS Systems Manager Run Command, if available on your AMI.

You create an SSM Document with Terraform that uses the cloud-init status --wait command, then you trigger the command from a local provisioner, and wait for it to complete. In this way, you don't have to play around with tags, and you are 100% sure cloud-init has been completed.

This is an example of the document you can create with Terraform:

resource "aws_ssm_document" "cloud_init_wait" {
  name = "cloud-init-wait"
  document_type = "Command"
  document_format = "YAML"
  content = <<-DOC
    schemaVersion: '2.2'
    description: Wait for cloud init to finish
    mainSteps:
    - action: aws:runShellScript
      name: StopOnLinux
      precondition:
        StringEquals:
        - platformType
        - Linux
      inputs:
        runCommand:
        - cloud-init status --wait
    DOC
}

and then you can use a local-provisioner inside the EC2 instance block, or in a null resource, up to what you have to do with it.

The provisioner would be more or less like this:

provisioner "local-exec" {
    interpreter = ["/bin/bash", "-c"]

    command = <<-EOF
    set -Ee -o pipefail
    export AWS_DEFAULT_REGION=${data.aws_region.current.name}

    command_id=$(aws ssm send-command --document-name ${aws_ssm_document.cloud_init_wait.arn} --instance-ids ${self.id} --output text --query "Command.CommandId")
    if ! aws ssm wait command-executed --command-id $command_id --instance-id ${self.id}; then
      echo "Failed to start services on instance ${self.id}!";
      echo "stdout:";
      aws ssm get-command-invocation --command-id $command_id --instance-id ${self.id} --query StandardOutputContent;
      echo "stderr:";
      aws ssm get-command-invocation --command-id $command_id --instance-id ${self.id} --query StandardErrorContent;
      exit 1;
    fi;
    echo "Services started successfully on the new instance with id ${self.id}!"

    EOF
  }
Extensometer answered 19/8, 2021 at 14:40 Comment(4)
I get An error occurred (InvalidInstanceId) when calling the SendCommand operation: Instances [[i-09eb8edc7df904bd9]] not in a valid state for account 71********30 and I'm not really sure why. (I checked the ID, it's valid)Otherdirected
Manually running the aws ssm send-command in the instance after its created works though...Otherdirected
I added a sleep 30 in the begining of your comand block to make sure the instance was running before it's executed, it did the trick.Otherdirected
Also for reference, if someone else needs this. the SSM commands are ran by the machine running Terraform, not the instance. So if you're using aws profiles, you want to add --profile your_profile to all aws ssm commands.Otherdirected

© 2022 - 2024 — McMap. All rights reserved.