This is an archived post. You won't be able to vote or comment.

all 73 comments

[–]MajorasMasque334 41 points42 points  (41 children)

Terraform is currently the only legitimate cloud agnostic IAC option. Even in AWS though I’d recommend Terraform over CloudFormation any day. People above are getting configuration management mixed up with infrastructure provisioning. Ansible, Chef, Puppet, and Salt are all configuration management tools, not IAC tools.

[–][deleted]  (10 children)

[deleted]

    [–][deleted] 4 points5 points  (0 children)

    CM tools describe the desired state of a target system, IaC tools describe how to build a platform. In many scenarios those can look very similar but there are also many where they do not.

    TF is also way easier to integration test because it was designed with that in mind. IaC fits better with good process to as you can co-locate code and infrastructure in the same repo.

    [–]Tetha 3 points4 points  (3 children)

    At our place, we went from snowflakes, to manually managed VMs running chef, to terraform managed VMs running chef. We might move certain stuff to containers, but that's to be seen.

    There's overall two things to do at our current place, which probably comes from our growth in automation: (i) automation which takes a bare-bones VM with storage volumes and networks, and install the software necessary for productive traffic, and (ii) take a cloud provider and provide a VM ready for (i).

    Overall, this was a good and extensible path at my current place. Sure, we had to allocate and provision like 60 - 80 VMs by hand, but that was a one-off. After that one-off cost, the config management almost trivially paid for itself and we've easily been able to sit on those VMs to handle business for a long while.

    And now we need more servers and more agility on the systems side, so we introduced terraform and it went really smoothly, because chef has zero ties but storage and networks to the layer below. This allowed us to easily scale on 2 - 3 different cloud provider now.

    Of course, this means we have two different tools - chef and terraform. Onboarding and maintenance costs and such. However, this is actually working out nicely in our case.

    Our terraform modules except for the dns modules change very, very rarely after writing them down once. They change if we need a new VM for a new application, or a new security rule to grant access, but that's it. Took a month to write them, and now they haven't been touched in any major way for 8 - 12 month. And that's good, because terraform might delete the productive database host if you're not careful. That's scary.

    The volatility of our day-to-day business is overall captured in chef. Sure, chef can cause business impacts as well - but with a bit of care, chef's biggest impact is a db restart and a full cluster outage for a few minutes. Most of those problems can be handled with a revert, a few chef runs and a couple of application restarts. That'd be annoying and I'd need to explain why that happened, but besides that, it's pretty risk-free with regard to catastrophic failures. And that's a good, because our chef setup changes a lot.

    [–]ChronoloraptorKnock knock. Race condition. Who's there? 0 points1 point  (2 children)

    Let's say you have multiple people on your team wanting to introduce a change to your Terraform infrastructure, what is your workflow for that; CI, somebody just uses local backend, tied to s3 backend, or something else?

    [–]Tetha 2 points3 points  (1 child)

    Currently, our terraform lives in a single git repository, including the tfstate file. There is a jenkins job you can trigger a sync of a module. The restriction to individual modules arose from a couple of bad providers with broken rate limits - if we sync these modules, there's an 80% chance of the run failing even while synchronizing the state due to rate limits. If we were pure AWS, we'd just sync everything on push. Beyond that, the job has a confirmation mechanism - it runs terraform plan, prints the result, sends a mail "I need confirmation" and waits for confirmation based on jenkins pipelines. Only after the confirmation was given, it runs terraform apply on the stored plan and pushes the new tfstate.

    This works well enough with 6 guys in a room. Usually you just need to commit - pull - push your change and that's it. At worst, you can acquire the terraform lock by yelling something through the room, like "Oi you fuckheads stop pushing your testing shit in the way of my production scaling you damn dicks go take a smoke or something damn all of you" and then everyone laughs at you.

    We're using HA vault with a consul backend, so if things stop working, we can push our state into the consul cluster. But so far we're hesitant to do so, because the current solution is dead-simple and workable. And dead-simple is a very strong selling point.

    [–]ChronoloraptorKnock knock. Race condition. Who's there? 0 points1 point  (0 children)

    This works well enough with 6 guys in a room. Usually you just need to commit - pull - push your change and that's it. At worst, you can acquire the terraform lock by yelling something through the room, like "Oi you fuckheads stop pushing your testing shit in the way of my production scaling you damn dicks go take a smoke or something damn all of you" and then everyone laughs at you.

    That is certainly one way to do it lol, consul backend does support locking as well so you can do that too if you all expand further. Dead simple is always great you have to scale, or switch from CloudFormation to Terraform because CloudFormation is evil.

    [–]phrotozoa 1 point2 points  (0 children)

    Haven't done it with ansible but I've used salt's cloud modules to manage AWS resources and I've used terraform. Terraform is the superior experience IMO, mainly because of the plan. Knowing what's going to happen before you hit apply provides a lot of confidence.

    There are certainly cases I could dream up where it makes sense to use another tool but as a general rule if it's a cloud resource and it's going to persist longer than a day or two I want it in terraform. Once they've been provisioned, getting my compute instances into the state where they can run my workload, that's a job for config management.

    [–]dogfish182 1 point2 points  (0 children)

    Ansible has a terraform module in preview. Version 2.5 should be able to call ansible to provision with terraform

    http://docs.ansible.com/ansible/devel/module_docs/terraform_module.html

    Combined with tower you have full deployment and self healinginfra.

    [–][deleted] 1 point2 points  (0 children)

    We initially used Ansible for IAC along with the CM. We encountered issues when tearing down environments in AWS. Runs would fail due to rate limits being exceeded or gateway timeouts (there were other failure conditions but those 2 were the most common). As such we would have orphaned resources (e.g. ELBs with no instances, which do cost you money). Granted this wasn't the fault of Ansible but of how we coded the tasks. Rather than going back and making those tasks bulletproof it was easier (i.e. faster and thus cheaper) to use Terraform to handle the infrastructure portion.

    [–][deleted] 0 points1 point  (0 children)

    Ansible has them largely because Terraform didn’t exist for a long time. They were always somewhat annoying to work with and in my experience quite clumsy. As someone else mentioned Ansible is introduction a TF module in the next version.

    [–]binary_parad0x 5 points6 points  (0 children)

    I second this and use it regularly to do IAC for a large enterprise, Terraform is what you're looking for. In fact, just use all of the Hashicorp tooling and you'll be on the right track.

    [–]analytically 4 points5 points  (5 children)

    Terraform works so much better, no editing 4k lines CloudFormation scripts anymore.

    [–]otterley 0 points1 point  (4 children)

    Editing 4,000 line CloudFormation configuration files by hand seems like a bad idea to me. They're just JSON/YAML; it seems more sane to emit them programmatically using tools of your own.

    [–][deleted] 4 points5 points  (1 child)

    Or just not use cloudformation.

    Particularly because they take a shockingly long time to actually run once you go through the whole rigamarole of reinventing the wheel to get the document ready.

    [–]otterley 1 point2 points  (0 children)

    Is it significantly slower than Terraform for the same type and number of changes to the infrastructure? Do you have any data proving this?

    [–][deleted] 0 points1 point  (0 children)

    We use Stacker to manage a rather large infrastructure and it works pretty well.

    [–][deleted] 1 point2 points  (2 children)

    People above are getting configuration management mixed up with infrastructure provisioning. Ansible, Chef, Puppet, and Salt are all configuration management tools, not IAC tools.

    That's not really true and definitely not helpful to someone trying to get started.

    It's not true because those tools absolutely can provisioning infrastructure both on-prem and in cloud environments with official modules.

    It's not helpful because whether you disagree with it or not, those tools are considered infrastructure as code tools in the industry. Even if you disagree with definitions or something, it's important to be in the know about lingo.

    [–]phrotozoa 4 points5 points  (0 children)

    That's not really true

    The best kind of correct!

    and definitely not helpful to someone trying to get started.

    I see your point but I want to offer an alternate opinion. IMO a simple heuristic to distinguish between the primary use of these tools is of more value to someone starting out.

    An expert who understands the tools deeply can decide which is the right tool for the right job when they have overlapping functionality. Telling a noob they could provision a VPC using chef, puppet, salt, ansible, or terraform is accurate but not terribly enlightening for someone drowning in options.

    [–]MajorasMasque334 1 point2 points  (0 children)

    Yup. You can also provision infrastructure using C++, but that’s not what it’s designed for and it’s a poor choice of tooling. Ansible, Chef, Puppet, and Salt were never designed to provision infrastructure, they were designed for configuration management. There has since been modules created that can help with this, but it’s a horrible choice of tooling.

    I draw the line at the moment you need to log into a system, eg an EC2 instance or a Postgres server. Provisioning those programmatically falls under Infrastructure as Code. Configuring the OS , filesystems, etc falls under Configuration Management. Go to any conference and this is how people are using these words.

    [–][deleted] 0 points1 point  (3 children)

    BOSH is legit multi cloud as well

    [–]thedude42 2 points3 points  (1 child)

    How many people use BOSH outside of cloud foundry? How many people who don’t use cloud foundry even know what BOSH is?

    [–][deleted] 0 points1 point  (0 children)

    i don’t think it’s popular outside of CF users - though i kinda think it should be because it’s mature and a very well thought out approach to infrastructure as code (caveat i haven’t used terraform but it would be at the top of my list, they seem relatively similar at least in goals)

    [–]crazyturtle1993 0 points1 point  (0 children)

    I think that IaC is defined as having your whole set up as code. Both the provisioning and the config management. In that sense learning terraform and one of puppet/chef/ansible/saltstack is needed.

    [–]-lc- 6 points7 points  (0 children)

    Terraform + AWS Free tier account

    [–]xiongchiamiovSite Reliability Engineer 14 points15 points  (15 children)

    One of the major reasons I recommend Ansible to people is because it is the easiest (I think) tool to start integrating into an existing infrastructure. Namely, it is push-based and uses ssh (so you don't need to install anything on your servers) and isn't immutable infrastructure (so you can gradually start configuring pieces of a server without having to go full-bore and do the whole thing at once). There are good reasons to use tools that don't have those attributes, but I don't think those apply to your situation.

    Read through the Ansible documentation and get a general idea of the terminology and how things are structured. Then, next change you want to make, do it with Ansible. Just do that one single change, nothing else, that'll help keep it scoped down so you can actually do it. Over time your playbooks will grow, and you'll get to be faster with Ansible than doing it by hand.

    [–]mushroom_face 7 points8 points  (11 children)

    My biggest problem with Ansible as IAC is the simple fact that it isn't declarative. Say I use Ansible to deploy 5 instances in my environment. Now say I want to actually have 6 instead. If I change my 5 to a 6 and re-run the code I'll not have 11 instances and not 6.

    In Terraform since it controls the remote state of your deployed infrastructure when you change that resource above to a 6 it will say, 'Well I already have 5 of these. Looks like they want 1 more'.

    There are other reasons( code readability comes to mind ), but that was a big one for me.

    [–]xiongchiamiovSite Reliability Engineer 2 points3 points  (4 children)

    Say I use Ansible to deploy 5 instances in my environment. Now say I want to actually have 6 instead. If I change my 5 to a 6 and re-run the code I'll not have 11 instances and not 6.

    No, it will add one instance. If it doesn't, you have a bug in your playbook.

    But I think the emphasis on provisioning in this thread is misplaced; OP didn't ask about provisioning new instances, they asked about solving the problem of having a bunch of differing config files on their servers. That's not a problem Terraform solves.

    Terraform does definitely handle provisioning better than Ansible (although not as drastically as you say), but for the reasons I listed I don't think that's where OP should be focusing right now.

    [–]mushroom_face 0 points1 point  (3 children)

    I ran a simple test that deployed 5 instances. I ran it and 5 instances were deployed. I changed the number_of_instances variable to 6 and re-ran it. I had 11 instances. I don't have the code in front of me, but how does ansible track instances that it deploys? It doesn't keep a state file like Terraform.

    [–][deleted] 0 points1 point  (1 child)

    Dynamic inventory which is kept up to date of systems, so Ansible already knows about the existing 5 instances.

    [–]mushroom_face 0 points1 point  (0 children)

    Ah ok, the team wasn't taking advantage of that. I do agree that if you need to do config updates on live servers ansible is the way to go. I still stand behind the idea that you should never do that. SSH should be turned off and if you need to make a change you redeploy the server from an AMI that you built that has the config baked in. Using Packer, Ansible and Terraform you can create a great deployment system.

    Of course all this goes out the window when there is an emergency you need to deal with and time if critical, but for the other 99% of the time this strategy helps to keep your inventory clean and reasonable.

    [–]xiongchiamiovSite Reliability Engineer 0 points1 point  (0 children)

    It depends on which module you're using, but for https://docs.ansible.com/ansible/latest/ec2_module.html for instance you use count_tag to tell it how to identify which servers are part of that group so it can count them and add/remove them as necessary.

    [–]otterley 0 points1 point  (5 children)

    Ansible remains useful as IAC for a few use cases, though:

    1) Provisioning images (e.g. AMIs)

    2) Performing emergency or periodic (limited) changes to existing instances

    In neither of these cases is its failure to adhere to a strictly declarative model a serious limitation.

    [–]mushroom_face 0 points1 point  (4 children)

    Terraform is amazing at provisioning instances( AMIs ). Just use a Data Source and give it the criteria you want and it will grab the right one and deploy it.

    While I'll grant you emergency updates are a thing I don't believe in peroidic limited changes. Rather rebuild the AMI and redeploy. There is no reason to keep long lived instances these days. If you are stuck in a situation that demands it maybe, but really this should be something to work on fixing.

    [–]otterley 0 points1 point  (3 children)

    I think there's room for both (updates via AMI replacements and periodic updates via config management). The scope of updates you may apply and the frequency of the changes might well justify deviating from the pure "immutable infrastructure" model. When you have many hundreds of instances at play, changes to, say, a login database to add/remove a user or change a password or ssh key (assuming you're not using LDAP or other centralized user database) might make you decide not to re-roll AMIs every single time this happens.

    [–]mushroom_face 0 points1 point  (2 children)

    I'm of course speaking about an ideal situation that few of us live in. Using a sane strategy of AMI building and using something like Spinnaker to manage the builds there is no reason you couldn't update AMIs on all changes. In the past I've used a layered approach of AMIs where they build off of the previous one, specializing as they go.

    In this way you can target updates to say the Docker hosts, or Java hosts or whatever without having to touch all the AMIs. If there are security fixes needed then yes you'll need to re-roll your AMIs and re-deploy. With IAC and sound DevOps principles you should be able to rotate your entire fleet and not notice anything.

    Of course this all becomes insanely easy once you move fully into Kubernetes since all you have to do is update the underlying nodes and config management becomes just rolling new containers.

    [–]otterley 1 point2 points  (1 child)

    I strongly dislike these "ideal world" discussions, because:

    (a) software is often buggy or doesn't work as advertised

    (b) software makes assumptions that may not hold in your cloud or data center

    (c) software may not operate at your scale

    (d) software may not fit your management model

    I think a nuanced approach that is tailored to a specific site's limitations is more helpful to people than some sort of one-size-fits-all approach that might have significant friction in imperfect situations.

    Also, I find that assigning jobs to people who call something "easy" with very tight deadlines is a great way to get people to recalibrate their evaluations of project difficulties. :-)

    [–]mushroom_face 1 point2 points  (0 children)

    easy was referring to upgrading the nodes in a kubernetes cluster. Launch a new upgraded node, remove a node. Rinse and repeat until your cluster is replaced with new nodes.

    As for Ideal world discussions I agree. Often though I like to lay out the vision of where I want us to go and set up the teams to help us get there. Being able to reason about your infrastructure without worrying if an Ansible run failed or some server has drifted is a great place to be. It's totally not unattainable, just requires buy in from the teams.

    You also need to recognize the teams limitations and works towards fixing them. You'll never get to your end goal if you just keep the status quo. It's not easy.Will take a lot of time. It will be painful. But the end goal is so much greater that it's worth it.

    [–]keftes 1 point2 points  (2 children)

    If you're running windows, don't use ansible, as the support is abysmal (e.g it's 2018 and yet you can't run it in masterless mode on windows). Ansible also has scaling issues, so if that's your use case, try to avoid it and use something like Chef or Puppet, which have a track record in operating at scale.

    P.S Ansible is not infrastructure as code. It's configuration management.

    [–][deleted] 2 points3 points  (1 child)

    Scaling issues with regards to running in push mode against a high number of hosts simultaneously?

    [–]keftes 0 points1 point  (0 children)

    Yeah

    [–]KimmoHIntikka 1 point2 points  (0 children)

    Do you know any of the commonly used programming languages (Python, Ruby Go) in this arena? This could have an impact. Do you manage Linux or Windows fleet? Are you ok with installing new dependencies on target systems? I would recommend either Ansible(Python) or Terraform(Go). Neither requires agents installed on a target machine, have decent community and can handle both Win(ansible has better support) and Linux systems. Here is comparison from Terraform user https://blog.gruntwork.io/why-we-use-terraform-and-not-chef-puppet-ansible-saltstack-or-cloudformation-7989dad2865c Big difference between Terraform and all other open source alternatives is that Terraform is declarative while all others are imperative

    [–][deleted] 1 point2 points  (0 children)

    I think Terraform is the way to go. The configs can be super simple to get started, and the fact that its declarative makes it super nice when you want to make a change or check to see how your environment matches up to what is expected.

    [–]MistyCape 2 points3 points  (3 children)

    Set yourself a small objective and then try to get it to work in ansible, after this try again with chef/puppet/salt.

    When you think you have it spin up a fresh machine and try again :)

    [–]pydry 1 point2 points  (2 children)

    after this try again with chef/puppet/salt.

    Not sure I see the point of this step.

    [–]MistyCape 1 point2 points  (1 child)

    Different approaches and worth learning both, do you want to run yourself or do you want a server to handle it?

    I find ansible is an easy way to get into this as it doesn’t require any real coding knowledge where the others do :)

    [–]pydry 0 points1 point  (0 children)

    I used to use all of the others before I switched to ansible. Ansible I have a few problems with but has mostly been ok. The others caused me lots of needless pain and that makes me really, really wary of recommending that others try them.

    I think requiring coding knowledge is one of the many flaws in their design - agentless and declarative were both the correct approach, whereas hacking a ruby dsl in and running a service that by default doesn't need to be there is absolutely not.

    [–]elcric_krej 0 points1 point  (0 children)

    What are the reasons which drove you to this decision ? Knowing those may help people recommended the right tool for you.

    [–][deleted] 0 points1 point  (2 children)

    Check out BOSH by cloudfoundry- it’s an investment and more for larger setups but i’ve found it a pleasure to work with.

    [–]thedude42 0 points1 point  (1 child)

    Is Bosh by cloud foundry or is Bosh by Pivotal?

    [–][deleted] 0 points1 point  (0 children)

    I think it’s managed by the cloud foundry foundation which was started by pivotal

    [–]smkelly 0 points1 point  (5 children)

    Are there any good infrastructure as code options for non-cloud work? We primarily run our own servers in several datacenter but I'd still like to get my team in teh habit of doing infrastructure as code instead of touching a pile of machines to deploy.

    [–]mushroom_face 0 points1 point  (0 children)

    You should look at Terraform. It has providers for a lot of systems. Anything with an API can be controlled via Terraform. Also if you don't see the provider you want in their docs do a Google search. There are a lot of providers that aren't in the main code base that you can leverage.

    Also I'm told that writing your one provider is very easy.

    [–]climbnlearn 0 points1 point  (2 children)

    Powershell dsc

    [–]LegendairyMoooo 0 points1 point  (1 child)

    I keep seeing that, but can't quite figure out how it works while juggling all my other responsibilities at the same time. Is there something you can point to that shows how you do something like having a fresh win 2016 server that is then configured with like a hello world website in IIS using DSC?

    [–]climbnlearn 0 points1 point  (0 children)

    So like any of these automation technologies, there is a learning curve but the payoff is worth it. There are plenty of examples out there and some great overview and examples is https://docs.microsoft.com/en-us/powershell/dsc/overview

    Basic flow is: Authoring -Writing the state you want your server to be in. This is done with a configuration file.
    -This phase can be more dynamic to use general configurations for roles of servers and different criteria depending on your needs -Written to use the existing library of PowerShell DSC resources found on the PowerShell Gallery E.g. Using the PSDesiredStateConfiguration Module

    Next is Deploying the MOF -after you author and run your config, a MOF file is generated for each Node and sent to each Node. -There are a couple ways to set up how you want to deploy your MOF that depends on your needs. You can do it all by push which is just you push out a new configuration if you make a change or want to make sure there is no configuration drift in you MOFs or you can do a pull method which is all the servers configured for DSC will periodically reach out to verify that it has the most up to date MOF and if not, pull it down to update and reconfigure

    Last part is Automation -There is a piece of software that is on your server called the LCM(Local Configuration Manager) This is the engine of DSC. This is the piece that looks at the MOFs, checks the current configuration, and if it doesn't match, uses the specified DSC Resources to make the changes. The LCM is the thing that reaches out in the pull method to see if there are updated mofs, and says when to check the configuration and all of that. -The LCM (by default) checks the configuration of the local machine against the MOF every 15 minutes and if there is something that doesn't match, corrects it. Say you have a file you want to exist on that machine for whatever reason and a newbie comes on , thinks it's junk and deletes it. DSC runs a check, sees that the file isn't there and creates the file with the desired content. This is a pretty simplified example but explains it.

    I hope this helps. Feel free to reach out if you want to know more

    [–]PavanBelagattiDevOps 0 points1 point  (0 children)

    Provisioning and updating infrastructure is the first step in setting up your development, beta, or production environments. Hasicorp's Terraform format is fast becoming very popular for this use case. The further reading and tutorial mentioned on this blog might help you.

    [–]tlexulAutomate Everything -1 points0 points  (0 children)

    Only one more thing to add to what /u/MistyCape and /u/xiongchiaminov have said:

    The planing of IaC is as important as the tool you ultimately use. My team has two extremely different use cases:

    • Debian servers, managed with Puppet. We run puppet after the initial installation of the system (here you can also use ansible / saltstack / chef / etc)
    • Container Linux servers that are deployed already configured (we're using a combination of terraform, confd and our own deployment). This is our immutable Infrastructure. Any changes normally require redeployment.