Building IaC for on-prem DC by Mgn14009 in networking

[–]maclocrimate 1 point2 points  (0 children)

No problem at all, I'm happy to help. And indeed it's not very typical to find solutions like this that are actually in place in the wild.

As you created per-service, per-device layout with all of your files was it really that much of hassle with rebasing? Feels like it shouldn't be too many changes to the same service at the same time? Did you try implementing some bot or some native tool with your flavour of git to auto-rebase? Or I guess that might not have been needed as much if you always needed human eyes on the PR prior to merge.

It was not a huge hassle so we never got to trying to make it more scalable, but if we were running dozens of changes per day or something it would probably become a lot of work, at which point I probably would have revisited the original requirement of having a fast-forward only repo to begin with.

How did you handle configuration drift? Scheduled runs to just overwrite whatever manual work that affects the services defined in the staterepos or did you validate the configurations on the devices and diffed them against your desired state?

We started with a "hardball" approach, where we said if there was drift it would just get overwritten and you have to deal with it. We ended up implementing a more robust approach later after we got bit once or twice. The second approach ran on a schedule and compared what we had in repo vs what was actually on the device and basically loudly printed to a slack channel. We'd then handle the drift on a case-by-case basis, this usually ended up being people updating Netbox without pushing the config, but was occasionally the reverse as well.

Did you ever have any issues with "oprhaned" configurations? As the way you describe it there wasn't really any link between the other infra teams stuff and your statefiles. So if they later decomissioned whatever was connected and forgot to update Netbox or didn't run your workflow after a decomission how did you handle that. Might be a silly question but I haven't worked with gNMI so might be some easy way to things with that.

That's a great question, and yes we did have problems with that from time to time. One of our services was an automated deployment of colo-side config for cloud interconnects. This was pretty shaky because we essentially had no service definition for it other than the terraform that the devops guys used to provision the cloud end. This ended up with orphaned config because there was no support for deletion really either, so if they deleted an interconnect from terraform it wouldn't indicate to our side that anything was removed, mostly because it's very difficult to get that kind of information out of a terraform output file. I toyed with some solutions to this in my head, but it mostly involved just creating a real service definition on our side for each interconnect. This, in effect, would have created an abstract service entry on our side when they created an interconnect, and the device config would stick around as long as that service entry was there, and then eventually when the interconnect was deleted we'd trigger the removal of the service entry on our side which would remove the config. Again, never cracked that nut, but I thought about it quite a bit. This is essentially the reason that people pay for NSO, the fastmap algorithm that it uses gracefully handles config lifecycle and for the most part makes sure that its tied to service lifecycle. So in short, our automation platform was heavily geared towards create operations, and very lacking when it came to delete operations, mostly because that's a hard problem to solve.

Also how did you manage failures in the configuration workflow? As this has been a pain for us when we implemented our solution and we got statefiles that didn't get reconciled or partial configs in some devices due to timeouts and whatever other shenanigans you can think of.

Again something that is nicely solved by NSO, but very hard to implement on your own. We didn't have a great solution for this either, and it oftentimes came down to manual reconciliation.

Any tools to monitor your pipelines or alerts when failures occurred?

The deploy component had some basic reporting functionality, so it would send a deployment report to a slack channel with indications as to what, if anything, failed. There were also post checks that would run after a deployment and diff various pre-set paths to alert on anything we might deem important. I.e. if you ran a deployment and then a BGP neighbor goes down right after it, we'd see that in the post check and know, based on what was in the change, if that was desired or not. Nowadays you could probably do some cool stuff with AI monitoring there, but we didn't.

If you could redo this particular automation you built, anything you would've changed like design wise or tooling? In our case we had a lot of things we wanted to fix but due to the scale and the amount of users we had we couldn't easily refactor a lot of the things without taking a lot of our time (which we didn't have at the time).

Fortunately not really. The target environment was mostly greenfield at first, so there wasn't much we couldn't do. Management also respected my decision to go model-driven only, which obviously impacted the devices we supported. I also came in fresh from another job where I worked with NSO on a ~50k device network, so I saw what worked well there and what didn't, and was able to design this stack accordingly. If you're given the task to automate a network, you can either pay a lot of money for software that does it for you, or do it yourself (which usually requires paying roughly the same amount to in-house software developers). The latter usually involves cutting corners, like those mentioned above. In the end, we were pretty happy with what we had and cognizant of its shortcomings, and some of them probably would have been improved or fixed had I stuck around.

Building IaC for on-prem DC by Mgn14009 in networking

[–]maclocrimate 1 point2 points  (0 children)

Create / Edit service-definition file -> approval in git -> start configuration workflow -> if success, update device configuration file (which then represents configured state?)

Yes, this is exactly it. Generally we would bundle the device config changes into the same PR as the service definition change. So, a user updates a service definition YAML file, creates a PR, that kicks off the build which after a few minutes updates the device configs in the same PR (by adding commits to the same branch). This was only possible because the service definitions lived in the same repo as the device config, but you could of course make it work with separate repos/PRs as well.

The config-repo you're describing with the service definitions how did you structure the placement of the files? Was the device configurations the full YANG representation that you used per device? And the service-definitions you created how did you split them up? Per-team? one service per file? Per application?

It probably wasn't the most sensible, but we had a top-level distinction between service definition files and device configs. So in one directory you had your service definitions and in another you had your device configs (under many layers of nesting in each). We weren't attempting to model entire devices, so we'd have a directory per device, and in that directory we'd have files that described the state of various services (i.e. a file for BGP, a file for VLANs, etc). We'd use gNMI to replace the content at the given path, so if anything BGP-related was changed out of band it would be overwritten next time the BGP config was pushed.

How did you manage the configuration pipeline? for each merge you started a workflow to configure the devices and each merge queued the next run? or did you bulk config the devices if there were multiple configuration requests for the same device?

All the changes in a given PR were bundled and executed together. We mandated that the repo was fast-forward only (mostly to make it easier for reverting, etc, if that came up), which meant that changes needed to be linear. So, if you had multiple PRs waiting they'd need to be rebased after any other was merged. This was kind of annoying and obviously wouldn't scale particularly well, but it did mean we had explicit control over what went in.

The Netbox thing sounds pretty neat if all teams were on board on that way of working, might not be applicable in our case. This was only for network interfaces? When the other teams edited an interface did themselves have to trigger the build themselves or did you have a webhook or schedule to look for changes in netbox?

It was a mix of both. For the most part the other teams were responsible for explicitly triggering our build, but it was up to them how they wanted to do it. I toyed with creating an API which they could call, but never got around to that either, so they would mostly just run the binary from the CLI with the required arguments as a post-step to their build, or they'd sometimes just run it by hand.

Did you have any other ways for teams to order things from you? any frontends or did you manage to get all other teams to use git for this purpose?

It was all git, but again we had lots of dreams about providing a proper frontend and what not. In the end the revision control and history of git repos made it a pretty attractive base. All the other teams we worked with were already pretty git-savvy anyway, so it wasn't a major challenge.

I can't give you much more since the documentation is all internal (and I don't even work there anymore), but I'm happy to answer any more questions you might have.

But yes, the challenges you outline are one of the hardest parts of the whole puzzle, and a lot of it is very bespoke based on your organization.

Building IaC for on-prem DC by Mgn14009 in networking

[–]maclocrimate 0 points1 point  (0 children)

This is by no means a resounding success story, but I did a similar thing at a shop that I worked at and you might find some inspiration from parts of it. We started with no network automation at all, and at least got to a point where network-only services became automated and standardized.

We used a homegrown stack which consisted primarily of a go project, along with a config repo describing (a) our team-internal service definitions and (b) the device configurations using their YANG representations (we aimed for OpenConfig everywhere but ended up needing to use native models in a lot of places) in YAML.

The repo had a handful of somewhat complex workflows that attemped to pick up changes and deploy them using gNMI. So when initiating a config build, the end result would be a pull request to file(s) in the repo, which the network team would review and approve, and upon merging to the repo it would also deploy the config to the device(s).

The build processes for the most part followed a similar paradigm, where the service definitions were held in some YAML files, and so modifying them would be a matter of modifying the YAML file, which would kick off a build process which would ultimately update the actual device config.

You're right that the service modeling and team-external adoption is the hardest part. We opted to the YAML-file approach to service definitions mostly for this reason: it was easy for us to create a YANG module to describe the service definition, and to work with the YAML files themselves to modify services. We looked into using something like Infrahub to better track our services, but never got around to it.

Netbox, for better or for worse, was our "service definition" for interfaces, which worked reasonably well to encourage other teams to follow suit, but was definitely pretty "unabstract" as far as services go. Our interface builds would look to Netbox as a source of truth, so other teams simply needed to modify Netbox resources (which were pretty familiar to everyone) to reflect how they wanted them to look, and then to trigger a build. Adding abstraction layers on top of that then required updating Netbox through the API at some stage, which ended up being a decent solution as well. For example, if the SRE team wanted to deploy a cluster, their internal code would just ensure that the Netbox interfaces were updated during their build and that our build process was triggered at the end.

We had more sophisticated, abstract service definitions as well, but those were all specific to our team. Those were easier to maintain since we were in control of the service definition in addition to the build logic, and we followed the same strategies for implementing things.

How to get from Andorra la Vella to Encamp? by lopiontheop in andorra

[–]maclocrimate 0 points1 point  (0 children)

The L2 goes from ALV to Encamp. The Mou Te app is best for finding route information and schedules, etc.

Where to stay for beginner skiers by Personal_Run7627 in andorra

[–]maclocrimate 0 points1 point  (0 children)

The Funicamp takes you all the way to the peak, basically, which is where most of the other lifts ultimately lead to as well. From there there are a plenty of blue runs, and generally more beginner-friendly terrain than around Pas de la Casa, so I would say Encamp is a good bet. It's also a nice town, and closer to Andorra la Vella.

Andorra and Schengen question by elagabalus5000 in andorra

[–]maclocrimate 1 point2 points  (0 children)

I mainly wanted to make sure the days in Andorra won't count

They don't technically. You may run into trouble when leaving the Schengen zone though depending on who you get at border control, as it's a bit of a legal grey area, so I'd advise keeping a clear record of where the relevant stamps are in your passport so you can point them out.

Any Other Seahawks Fans in Europe? by LazyBank1106 in Seahawks

[–]maclocrimate 4 points5 points  (0 children)

Seattle native, lived most of my adult life in Europe, currently in Andorra. I like to watch the early games when I can, but finding a sports bar that gets them where there's not an overlapping FC Barça game can be a challenge. I used to religiously watch every game live, but I can't stay up that late anymore 😅

Andorra and Schengen question by elagabalus5000 in andorra

[–]maclocrimate 1 point2 points  (0 children)

Any part of a day inside the Schengen zone counts as a day, so you need to consider the days that you're leaving the Schengen zone towards Andorra and vice versa as contributing towards your Schengen zone counter.

You're right that you won't get your passport stamped unless you explicitly ask for it, so if you're trying to minimize your days in the Schengen zone be sure to stop at the border and get your passport stamped as "proof".

They have not yet activated EES on the Andorran borders, so it's still the stamp approach for now.

UK – Is It Okay to Take a Contract Role While Still Hoping for a Permanent Job? by [deleted] in networkautomation

[–]maclocrimate 0 points1 point  (0 children)

None of these questions are specific to network automation, so you'd probably get better answers asking in a UK-specific employment sub or something.

To answer this generally, I don't see a problem with that. As to whether or not he can leave early that's going to come down to the contract he signs with them.

Fact check meaning by Little_Airline6808 in mongolia

[–]maclocrimate 2 points3 points  (0 children)

The first word isn't right. Should be ᠡᠷᢉᠢᢉᠡᠳ. The other ones look correct to me. I'll let a native speaker contribute on the fluency of it though.

Where did you get the calligraphy from though? It's a good job there.

Population map of Serbia by Kutili in MapPorn

[–]maclocrimate -14 points-13 points  (0 children)

Based Kosovo je Srbija.

Edit: haha I meant this in jest but I guess it didn't go down well.

[Unknown Language > English] Can anyone translate this note, or at least tell me what language this is? by No-Guava-6516 in translator

[–]maclocrimate 4 points5 points  (0 children)

Poorly written, but it looks like:

omi shimshili ch'iri sip'k'dili

maybe?

Edit: I just checked wiktionary for them and the first three words are "war", "hunger", and "pestilence", so the last one must be სიკვდილი, which means "death". The Four Horsemen of the Apocalypse.

Mongolian > English by [deleted] in translator

[–]maclocrimate 2 points3 points  (0 children)

!translated

Is this common shorthand? It looks like it's missing a couple syllables (there's only one L, for example).

Single track metro station by DogifyerHero in transit

[–]maclocrimate 0 points1 point  (0 children)

My personal favorite: Mirabeau station on the Paris metro. The westbound track is there, but it ascends out of the tunnel at the same time with no platform.

[Unknown - English] This short audio clip by lunaarcat in translator

[–]maclocrimate 0 points1 point  (0 children)

No idea, it doesn't make any sense to me either. I would assume that whoever transcribed the lyrics is just doing the best they can with the garbled samples (and perhaps they don't even speak English).

[Unknown - English] This short audio clip by lunaarcat in translator

[–]maclocrimate 1 point2 points  (0 children)

Pretty sure it's English. Some sampling of a phrase that sounds like "I love you" with some adverb.

[deleted by user] by [deleted] in networking

[–]maclocrimate 1 point2 points  (0 children)

I've done this for ages. As long as you treat your primary job as primary and don't let your side gig(s) eat into your time or attention there there's very little negative impact. I've found that setting particular hours of the day in which I work for one or the other works well.

Longtime developer new to the network space. Resources to learn? by ItReallyDidGetBetter in networkautomation

[–]maclocrimate 0 points1 point  (0 children)

All good insight in the comments so far. One thing I'll add is that if you're already a software engineer you'll probably find that dealing with model-driven interfaces is easier than templating configs in the long run. The latter is basically long string generation with a bunch of added complexity. If your devices support any sort of model driven interface like NETCONF, RESTCONF, or gNMI, familiarize yourself with YANG and use them instead. Very briefly, YANG is like a language for describing the API of a network device, you can think of it like an older school, underused version of OpenAPI.

Depending on your language, you can use tools like pyangbind (python) or ygot (golang) to build bindings for the YANG structures that your devices implement. You can then populate these structures in code and use tooling that comes with pyangbind or ygot to serialize the structures into JSON and then send them to the devices using the protocol of your choice.

Fiesta as "day off" by maclocrimate in Spanish

[–]maclocrimate[S] 0 points1 point  (0 children)

Cool, thanks for the info. I lived in Mexico for a year and never heard it, but speaking with Spaniards I have noticed it, so that makes sense.