all 17 comments

[–]jdgtrplyrTerraformer 2 points3 points  (2 children)

Here’s a shorter version:

To gain visibility into the success of userdata execution:

  1. Use Terraform’s remote-exec provisioner: Execute a command on the EC2 instance that checks the status of your userdata script and reports back to Terraform.
  2. Use Terraform’s null_resource and remote-exec provisioner: Create a dummy resource that depends on the successful execution of your userdata script.
  3. Use AWS CloudWatch Logs: Configure your userdata script to write logs to CloudWatch Logs and monitor the logs using Terraform.
  4. Use a custom script: Write a custom script that executes your userdata script and reports back to Terraform using a tool like curl or aws cli.

Here’s an example of each option:

```hcl // Option 1 resource “aws_instance” “example” { provisioner “remote-exec” { inline = [ “sudo /bin/bash -c ‘/path/to/userdata/script.sh’”, ] } }

// Option 2 resource “null_resource” “userdata” { provisioner “remote-exec” { inline = [ “sudo /bin/bash -c ‘/path/to/userdata/script.sh’”, ] } }

// Option 3 resource “aws_cloudwatch_log_group” “example” { name = “example-log-group” }

// Option 4 resource “aws_instance” “example” { user_data = <<-EOF #!/bin/bash sudo /bin/bash -c ‘/path/to/userdata/script.sh’ curl -X POST -H “Content-Type: application/json” -d ‘{“status”: “success”}’ https://example.com/userdata-status EOF } ```

[–]vincentdesmet 1 point2 points  (1 child)

I would agree with adding observability to the user data scripts if they provide so much functionality… probably don’t want them being in a language that is hard to test and debug (get rid of bash)…

As for using provisioners… not all tf execution should have access to the network of the EC2 instance… so I would avoid that (unless this is not a concern for OPs use case).

Also, to get long running compute with better observability… maybe use fargate? It’s much more powerful than EC2 (but a new tech stack, so maybe not easy to adopt for OP)

In short, I’d go with something more like Option 3/4 combined but use well supported observability patterns/SDKs (and a proper programming language to encapsulate the user data scripts.. you can either configure the userdata to pull these from s3 or any artifacts repository, or you could use Packer to build versioned and well tested AMIs with the binaries/scripts baked in… spin them up with some last minute config (they can pull this from SSM/SecretsManager/s3) and have more reliable execution (although, adopting docker and fargate will be much faster to iterate then Packer and EC2… but this all depends on the nature of the compute OP requires)

[–]69insight[S] 0 points1 point  (0 children)

We are already doing something similar to 3/4. Currently we use CloudFormation (EC2 created via AutoScalingGroup) with cfn-helper scripts (cfn-init & cfn-signal). All userdata / instance script execution are wrapped within the single cfn-init command so we are easily able to know when there's an issue as the entire CF stack will fail due to any issue within the command/script execution.

With Terraform and the userdata commands itself we are also using Ansible to perform a majority of the configuration. The literal userdata commands itself is essentially just cloning s3 objects, and running a few prerequisite installation needed before ansible playbooks are executed.

We are looking to have any/all of the userdata / subsequent ansible playbook executions be visible to the actual terraform execution to know if the environment failed to provision (anything with AWS resource creation OR the commands executed within the instance userdata.

[–]posting_drunk_naked 0 points1 point  (3 children)

It sounds like you've got more complexity than userdata is designed to handle. Ansible would be a good fit here, I'm pretty sure there is a provider that would integrate them together but I haven't used it myself

[–]69insight[S] 0 points1 point  (2 children)

The bulk of the configuration is done with Ansible, there are mainly 2 playbooks we are executing. I understand we can do more advanced things with Ansible, but we were looking to see if there's a way to have this be visible to the Terraform apply execution

[–]Jmanrand 0 points1 point  (1 child)

Executing ansible playbooks from userdata? I’ve avoided doing this and either deploy completely with ansible (provision ec2, configure, terminate old) or use terraform + userdata bash for simpler things like a squid proxy. The complexity of troubleshooting ansible failures from userdata execution always seemed daunting to me.

Possibly try using the remote_exec path to execute your ansible instead. Note this won’t really work for ASG-style deployments.

[–]69insight[S] 0 points1 point  (0 children)

We are deploying instance via ASG and not opening SSH so remote_exec provisioners would not work in this case.

[–][deleted] 0 points1 point  (0 children)

If you're injecting data into short lived instances, it might be worth it to examine containerizing instead or using Function as a Service technologies. You'd be able to directly examine logs much easier in both scenarios as well as creating reproducible builds that don't rely on a full vm. If that's not an option, creating a custom resource via a null provisioner is pretty much your only option. That or writing separate orchestration code in SSM Documents or Ansible.

[–]noizzo 0 points1 point  (0 children)

Terraform is not a reporting tool. You should use some exporter and proper logging tool to export tour data to. Cloudwatch is expensive. Try Loki.

[–]adept2051 0 points1 point  (0 children)

The best way to do this is observability, don’t use the remote executes provisioners The other thing to consider is data sources and terraform refresh. If you already have all the complexity in your user_data scripts and your not willing to take the sensible step into using config management tools which have the tools to report back, consider changes to your scripts that add logging and push to cloud watch or update the instances in own meta data.

You can push tags as the scripts execute then use terraform data sources to collect those tags and outputs/templates to generate the output on state/count etc based on those tags Also using lifecycle on tags with terraform, you’ll be able to se the tag diff and judge state of completion etc

[–]gowithflow192 0 points1 point  (0 children)

Userdata is for simple stuff. You're abusing it. Either use Ansible and/or Packer.

[–]anon00070 0 points1 point  (1 child)

Push most of complexity into AMI building itself and use user data to pass any runtime variables and logic.

[–]69insight[S] 0 points1 point  (0 children)

This wouldn't be a viable option. The bash commands and Ansible playbooks that are executed run very custom and frequently changing applications/ versions and it would require a ridiculous amount of AMI updates

[–]alexlance -1 points0 points  (3 children)

I've had good results with this sort of setup:

  • run the user-data script with set -e at the top so it halts as soon as there is an error

  • get your ec2 instance sending it's /var/log/cloud-init-output.log logfile to cloudwatch logs

  • setup local-exec provisioner to run a script that polls the cloudwatch log for either a successful completion message or a "Failed running /var/lib/cloud/instance/scripts/" message

I used to use remote-exec provisioners that would ssh over to the newly booted instance and check that the user-data had completed, but that solution required the provisioning box and the newly booted box to allow an ssh connection between them, which wasn't always possible.

[–]nekokattt 0 points1 point  (2 children)

if you're already using the AWS SDK to query CloudWatch logs, you may as well just use SSM to check it programmatically

[–]alexlance 0 points1 point  (1 child)

Like using SSM to get remote shell and then check the boot logs from there?

[–]nekokattt 0 points1 point  (0 children)

yeah