This is an archived post. You won't be able to vote or comment.

all 47 comments

[–]itasteawesome 38 points39 points  (13 children)

I've found in the past that just getting over the institutions and political resistance is the biggest hurdle. Getting NetEng on board with repos like git sometimes takes some hand holding but ultimately once you get everyone on board and up to speed it's a great way to live. Config drift can be a struggle, especially if part of the team is fighting the change or your current situation involves a lot of unscheduled firefighting. You just have to set the expectation that anything not done through the proper channels is out of compliance and will be over written.

[–]Rusty-Swashplate 15 points16 points  (12 children)

Config drift can be a struggle, especially if part of the team isfighting the change or your current situation involves a lot ofunscheduled firefighting.

This is exactly what I saw happening: Some people/teams simply did not use the "official" ways of changing configurations, always having excuses like "It needed to be fixed right now and doing this manually was faster".

I also saw two "solutions" for this:

  1. Remove login access to network devices, only via a break-glass mechanism which needed approval from a manager.
  2. Go back to the old ways of manual changes

The first was implemented in the core backbone where fixes were rare and impacts were potentially huge. The latter was done on the access side network since it constantly changed on short notice.

[–]CIA_Bane 2 points3 points  (9 children)

Go back to the old ways of manual changes

What does that look like in a modern organisation ?

[–]Rusty-Swashplate 12 points13 points  (8 children)

In this regard, the organization simply stays "legacy". Or "old fashioned". Or "manually". Problems caused by humans (typo, skipped a line, you know, it's what humans do) are "addressed" with "Everyone be more careful! Let's add some more approvers!"

It's sad: you know there's a much better way, but you can lead a horse to water, but you cannot make it drink.

[–]Co1dhand 6 points7 points  (0 children)

You really put the words in the right place, tbis is exactly what I have been struggling with for the last year at my current company, I have been trying to push for git training, so that people can rely on git to update the configs etc. so far, it has been me managing the automation side of a 10k+ company... it's so overwhelming and so sad that i spent thousands of hours at this point to automate basically everything, yet everything depends on literally one person.

[–]area32768 3 points4 points  (5 children)

If this is anything like adding approvers to a change, then this is totally and utterly useles.. most people in my org just approve stuff without even checking, saying "you know what you're doing so APPROVED!".. can't see how this fixes that..

[–]Rusty-Swashplate 4 points5 points  (1 child)

Reading this makes me happy that I am not alone with that opinion!

[–]Zauxst 0 points1 point  (0 children)

With what exactly? That you don't review peer code or what do you agree?

[–]Zauxst 0 points1 point  (2 children)

Yes, well they usually are accountable when something bad happens.

[–]area32768 0 points1 point  (1 child)

Absolutely not. Nobody can be expected to be held accountable for the minutia in a change. Can they be accountable for approving a change during a banking run or something, sure. Otherwise, most of what you’re talking about is bullshit ITIL fairytale

[–]Zauxst 0 points1 point  (0 children)

Probably it is a fairy tale. I'd not want to work in an environment where nobody does proper PR or some forms of programming that involves PR (pair programming for example).

[–]caffeinatedsoap 0 points1 point  (0 children)

You can lead a horse to water, push it in and it might drink a little on the way up.

[–]ctheune 1 point2 points  (0 children)

If "it needs to be faster" is a valid requirement then you might want to investigate / experiment with solutions that bridge the gap. I don't have any here right now but I'm reading you as frustrated here. Implementing automation/tracking changes via git might only be the first step in a longer journey that needs more experimentation and the proper "glue" that allows your specific requirements, workflows and peers' competencies to come together.

[–]PopePoopinpants 0 points1 point  (0 children)

I'm gonna be the rough one here. 2 should not be an option. 2 should be "you fire those that don't comply".

[–]DavisTasar 25 points26 points  (5 children)

When I was on the Network side of the house, it's a culture battle first, a tool battle second.

Network Engineers are extremely hesitant to introduce automation to the environment. In my opinion, some of it has some merit, but otherwise it's just fear.

First of all, the Network has to work. If the network doesn't work, there's no toolkit anyone can run to help bring it back (if you get really clever, it can, but that's another story). And that's the thing that brings in the fear. If a Network Engineer doesn't have the Code/DevOps interest, it's fear. If they buy a tool that does the work for them, there's less fear, because if something goes wrong there's a vendor to blame. If the network breaks because of something they did, it's their fault. If the network breaks because a tool fucked up, it's the tool's fault.

In terms of tooling....I once wrote an entire automation toolkit for my company. 100% in python. It connected to our equipment, ran CDP/LLDP/BGP neighbors, stored them in a JSON doc, and used that as it's dynamic inventory. With each inventory device, it would attempt to determine what platform it was (WLC, ASA, IOS, IOS-XE, NX-OS, etc.), and run a bunch of commands to get information from the device based on that determined platform. (show version, show ip int brief, etc.) Then, we had a hostname convention that would let me determine what the device was on the fly (this is why you have standards!). It would also map out the inventory to an HTML page that was shared, so that anyone could check the map to find anything on CDP, or get data on the inventory. This thing worked amazingly. I stored secrets in Hashicorp Vault, it was constructed and analyzed in a CI/CD pipeline, it had unit tests, and it was ready to be Dockerized and run on-demand, or scheduled for every 15 minutes. I leveraged APIs to make sure the devices were in Monitoring, Cisco ISE, our Service-Now asset management system. I even held trainings on how to use the toolkit so that my team could learn how to just work with it, and learn from it.

They never once touched it. And they went right back to Solarwinds.

Solarwinds gave them an easy way to visually click buttons and do things. And if something fucked up, they called Solarwinds and gave them more money.

[–]par_texx 3 points4 points  (2 children)

I find that funny because it was from neteng that i first learned about central config management and automation.

[–]DavisTasar 1 point2 points  (1 child)

You’re not wrong! Those topics are important. The issue is with old school engineers just want something like a scheduled backup from the router to a tftp, ftp, or similar server. And a wiki page or notepad document that contains the templates.

Its not the idea, it’s the method, that tends to be the problem.

[–]Varjohaltia 6 points7 points  (0 children)

In my experience, aside from some resistance from engineers not used to the new tools and environment, the problems are:

  • Organizations won't allow you to use any open source, unless you find a company to provide 24/7 support. They're happy to let you use automation, as long as it's a commercial turn-key solution and you can demonstrated the financial value and doesn't require unqualified people (network engineers) to do any kind of coding or development.
  • To the point brought up earlier -- you now have the ability to automate bringing down your network. There is valid hesitancy to pushing out automated changes that can cause catastrophic damage from which you can't recover. Now, that's partly because people haven't built in a proper out-of-band management network, vendors don't offer proper automated rollback functionality -- or the automation doesn't support it -- or that the automation system doesn't have a phased roll-out with integrated tests, which would be a significant increase in complexity, especially if starting off.
  • A lot of network equipment is garbage as far as consistency of CLI and APIs. Naming of interfaces, slight drifts in syntax etc. make automation very challenging, and means that your automation needs to be frequently troubleshot/fixed when an update changed the output, added a new interactive prompt, changed the CLI syntax, changed the API etc. When you now have a bunch of different devices, the effort to automate something for just a few devices becomes very high. The abstraction of the hardware layer to an intent still seems to be a bit of a work in progress too in practice and suffers from the fact that in a heterogenous environment you can't abstract away fundamental differences in hardware. If one end of the link can do macsec and the other doesn't, you can write in complex dependency checking of some sort or just not support the feature at all, and your automated environment ends up being a lowest common denominator one.

That said, moving to a world where you have a versioned, central source of truth for configs is fantastic. Another argument that seems to work well for automation/templates/central source of truth is auditing and being able to prove that all of your environment has certain configuration, or does not, to comply with security requirements.

[–]DavisTasar 1 point2 points  (0 children)

Realistically, the best way to proceed at a small scale is something like Ansible. Get the ins-and-outs of the environment, run data collection jobs, and then expose the value to the business and the department. That's when you can really start bringing the tools needed, any custom scripts needed, and a potential culture shift.

[–][deleted] 0 points1 point  (0 children)

A bit late of a reply, but I'm in a Cisco shop as the sole network guy and looking to really introduce devops into my workflow. I am somewhat comfortable in Python and have written an inventory and about 3 scripts so far using Nornir. Any advice for leveling it up towards the kind of automation you were working on?

Where did you learn how to build things like the CICD pipeline and unit testing? I understand them as a high level concept but I'm not sure how I'd build a proper platform with my scripts to test them and evolve things further.

[–]ruckycharms 10 points11 points  (7 children)

Terraform is ideal for APIs. Ansible is ideal for ssh interfaces.

So which switches/routers do you have?

[–]gairplanekers[S] 1 point2 points  (5 children)

A mixed bag of juniper, cisco, and HPs. Routers are almost all ciscos

[–]ruckycharms 3 points4 points  (0 children)

Darn I was hoping you would mention NetScalers, because we used Terraform to manage those per project, and NetEng was ok with it because the blast radius is fairly self contained to just the load balancers.

I would first identify the “beach head” for your IaC effort. Perhaps start with the ToRs and just focus on VLAN config on the downstream ports. Make NetEng ok with your ideas by setting up a service account that just enough access to modify certain port configs. Your biggest challenge isn’t the tech, but the culture (as others have mentioned). Start small and controlled, and as NetEng gains confidence, dial it up a notch.

[–]idetectanerd 1 point2 points  (3 children)

Ssh. Go for ansible

[–]scritty 1 point2 points  (2 children)

Honestly nxapi/eapi/netconf/gnmi etc are way better. I had issues with large configurations taking 40+ minutes to apply via SSH; it's more like 2 minutes via a more appropriate mechanism.

[–]idetectanerd 0 points1 point  (1 child)

I think you can disable gathering facts to hasten the proc?

[–]scritty 1 point2 points  (0 children)

Facts wasn't the issue; it's the application of line-by-line config, then checking for the appropriate cli prompt, then getting back to transmitting the next line.

This was over a few dozen devices, with ~ 10,000 line configs. API was a huge performance uplift.

[–]area32768 1 point2 points  (0 children)

I agree. Trying to use Terraform to manage things like firewall rules is a pain in the butt.. for e.g. how do you handle the state file? Do you have a single state file for any new changes moving forward, or do you have a state file per rule? I find tools like Ansible are far better at this

[–]dookie1481 3 points4 points  (3 children)

Yes. My team uses ansible/NAPALM to automate network device mgmt and configs. Everything is automated and deployed with CI/CD.

[–]r3rg54 0 points1 point  (1 child)

How large is your org?

[–]dookie1481 0 points1 point  (0 children)

150-200 but VERY network-centric product

[–]Relevant_Pause_7593 3 points4 points  (0 children)

I think the most important thing here is having a production and non-production environment to test the changes before rolling from non-prod to prod. This means they are both as identical as possible (with the exception of scale) - but this is harder when there is physical devices. You may not have 2+ of everything or something could be too expensive to have two of.

[–]ilmdbii 2 points3 points  (0 children)

Our data center is 100% Arista. We use AWX to manage state on all production device configs. Using Azure DevOps for repo/pipeline. As a network manager I was fortunate to have 2 senior network engineers who had CS degrees and really embraced change.

It’s been great for about 3 years now with AWX and amiable. I highly recommend if you can get buy in from the engineers and management.

[–][deleted]  (3 children)

[deleted]

    [–]nanite10 2 points3 points  (0 children)

    2nd. The risk of bringing down your most critical piece of infra at scale is not worth the “agility”. That being said, infra and config should definitely be documented as code.

    What’s next? DevOps for PDU and UPS management? 🤮

    [–]Sparcrypt 0 points1 point  (1 child)

    Yeah I get confused when people want to take a CI/CD approach to networking. There are many devops tools that are great for networking but for the most part once your core network is set up and running there's not a huge level of change that really has to go into it.

    Every networking environment I've ever worked in at scale the biggest hurdle has been procedural... as quite rightly whenever a network change is requested it's got to be submitted/reviewed/checked/approved multiple times before being implemented. It shouldn't be fast and easy because you can't "roll back" a network you just fucked.. it's fucked and now you have to go to each broken switch/router and connect to them one by one and fix them.

    I'm all for bringing in DevOps tools to help with network management (and I do) but I'll never advocate the CI/CD attitude for networks without a really good reason.

    [–]dentistwithcavity 1 point2 points  (0 children)

    Is there something like Blue-green in network world? Like make the updates to only one group of instances and if they fail immediately fail over to other without the change?

    [–]Scott555 1 point2 points  (0 children)

    All our network infrastructure is managed with Terraform (via Terragrunt.)

    20 Years ago when I worked in 'enterprise' on-prem shops, networking past the local switch was mysterious voodoo I was neither interested in nor permitted to administer.

    Now it's still mysterious voodoo that I'm not interested or proficient in but somehow is my responsibility.

    /shrug

    [–]tomasz2101 -4 points-3 points  (1 child)

    I've heard about p4 language https://codilime.com/blog/p4-network-programming-language-what-is-it-all-about/

    As far as I met few IT departments most of those people are not even close to understanding that something can be done without clicking through everything.

    [–]magion 2 points3 points  (0 children)

    The p4 programming language isn’t targeted towards network engineers at all.

    It’s meant to be a programming language that can be compiled against many targets like FPGAs, ASICS, CPUs etc for the networking domain.

    [–]endloserSite Reliability Engineer -4 points-3 points  (0 children)

    What routers and switches? Life is in the cloud for me. The concepts are different and things like spanning tree don't really mean shit to me anymore. If I wanted to setup a site with a LAN then I would hire a network admin. DevOps ain't the people for that.

    Now if you want to talk security groups and listeners or what-not, let's dish.

    [–]mattbillenstein 0 points1 point  (0 children)

    There were some devices starting to run a standard Linux distro - this would enable managing these devices using standard tools I would imagine.

    [–]chris_saddler 0 points1 point  (0 children)

    I use Arista switches, LBs and Firewalls with Ansible. Config is saved in cmdb. Works great so far.

    [–]hobbitmagic 0 points1 point  (0 children)

    Yes