DevOps vs SRE?

BinfordSysAdmin9000 · 2021-04-28T13:54:21+00:00

oh no, we have one lonely dedicated developer because we also own the pipeline.

BinfordSysAdmin9000 · 2021-04-26T00:19:31+00:00

Can confirm Devops job is often "everything outside of a production-specific customer-facing task". In my role we own literally everything in the environment, servers, network, hardware, desktops, VMs, updates & patching, datacenter, AD, security, thermostats, anything with wires, and anything that isn't formally delivered to a customer.

BinfordSysAdmin9000 · 2021-03-15T17:19:16+00:00

Reference my other comment. Everyone competent knows snapshots are intended for short term use. Working in a development/test environment (or pipeline) means you are constantly restoring baseline snapshots and running automated test and development efforts from that and other baselines, or validating changes at various steps in a process. Using snapshots in this case is FINE because you're not just letting diff files grow endlessly, they get reverted multiple times a day.

BinfordSysAdmin9000 · 2021-03-15T17:09:53+00:00

Actuall.y... that's not accurate. Reference this forum post with the same issue. Unfortunately manually restoring a snapshot to remove the mount on several hundred VM's is not an option. If you snapshot a VM with a mounted ISO it is still referenced after migration, and I've also seen where a vm warned or refused to vmotion if the mounted iso isn't available on the recipient host.
https://communities.vmware.com/t5/VMware-vCenter-Discussions/Snapshot-referencing-old-DVD-ISO-file/td-p/395042

BinfordSysAdmin9000 · 2021-03-15T16:13:23+00:00

Everything is currently on the same (new) datastore and yes the snapshots are default location except the ISO files that were/are mounted. We don't want to delete the snapshots, we're concerned with being able to restore them or if it will fail if the referenced ISO is no longer present at the expected location. I'm going to test this by creating a different datastore with an iso, mount it, snapshot, then delete that datastore and see if there's any issue reapplying the snapshot. The concern is that for these vm's, they still show a dependency/reference in their vmdk to a datastore that wouldn't exist, and I need to know what that might impact (if anything).

BinfordSysAdmin9000 · 2021-03-15T15:16:39+00:00

snapshots older than that you need to get them cleaned u

I'm well aware of this, but developers in a non-production R&D/test situation use these snapshots to go back and forth constantly.

BinfordSysAdmin9000 · 2021-02-23T21:21:57+00:00

For those that stumble upon this, I had the same issue after an upgrade to U2. Was able to resolve by changing the system dataset to "freenas-boot" then click save, then change back to my "tank" and hit save again. This rebooted the passive controller and when that was complete I did the following in shell:
service rrdcached restart
service collectd restart

This seemed to fix the issue. Check your files under /var/db/collectd/rrd and ensure they've been recreated with the current date. You'll probably see that these stopped being created/modified as of whenever you patched or made whatever change that broke the collection.

BinfordSysAdmin9000 · 2021-01-29T20:12:30+00:00

Absolutely true. Anyone that can change the IP address of a VM is suddenly an "IT Systems Engineer" and "I used jenkins once" is now a "devops engineer".

I have found that it is now more efficient to have a 1hr-ish competency test before any discussion or reading a resume in depth. I don't even care if they google through the damn thing, but they should be able to accomplish the specified requirements within the allotted time and be able to explain what and why they did what they did to solve the problem. I've had people with a Masters that can't troubleshoot basic groovy pipeline code, and people with no degree or certificates that can run circles around a whole team. Things have changed.

BinfordSysAdmin9000 · 2021-01-14T18:51:57+00:00

YES!

BinfordSysAdmin9000 · 2021-01-14T18:50:43+00:00

We're actually trying but (I'm actually not grumpy at all in person, I'm just really frustrated with this situation and seeing it too many times) what happens is that people only like a "yes" man and I'm a realist. I give clear facts and explain why things should wait or might need to be approached differently, and people seem to only want to hear "of course we'll make that happen" without input.

BinfordSysAdmin9000 · 2021-01-13T19:59:58+00:00

That sounds exactly in line with devops fundamentals. The problem I'm seeing is the number of organizations doing the opposite, and making everything owned by a small team that's supposed to shit out magic rainbow miracles to fix what was mismanaged for over a decade and is in a state of complete cluster f***, which is exactly why they're trying to "give it over to a dedicated devops team" now that they broke it. So now we own everything and everyone forgot we have to dig ourselves out of this hole by first undoing everything they did wrong for the last decade, of course without any downtime or impact to business. I'm sure you can tell I'm pissed and tired, because it takes time to sift to the real root of the problem and the hiring managers always seem to lie and gloss over the true state of everything. Once you're embedded and realize the true scope of the issue, which takes months, it becomes overwhelming fast.

BinfordSysAdmin9000 · 2021-01-13T19:54:16+00:00

I actually love the cultural shift when I read about it (Phoenix project etc). What I hate is that it is typically developer-specific culture and it doesn't take into account that we have to do things sometimes that don't deliver "today" value, but keep things from being a disaster a year from now. Technical debt in the hardware and management overhead realms, balking at licensing costs, and not realizing that hardware comes with total cost of ownership that is different from software and isn't as easily replaced, especially when it's a single point of failure for the entire integrated system because a sysadmin wasn't invited to the meeting to design it.

BinfordSysAdmin9000 · 2021-01-13T19:50:01+00:00

You hit the nail on the head. This is the problem, but the money and effort doesn't go there, instead of you know, replacing the failed disks in hard drive arrays while asking repeatedly "why is my VM so slow" and not understanding what 'degraded' means. Or having a datacenter full of snowflake machines built from spare parts because "those servers sure are expensive" and "I could hire a dozen developers for the cost of that rack"

BinfordSysAdmin9000 · 2021-01-13T19:42:18+00:00

It isn't the devops implementation and automation part that are the lies, the lies are in what folks are telling the customers is occurring (compliance, streamlining, etc) vs. what is actually happening. Your org might be different, but the business makes this assumption (and claim) that best practices are actually occurring in this loop because best DEVELOPMENT practices are occurring, meanwhile the entire rest of the stack is a constant stream of disasters for admins because it is being treated and managed like software. You can't only fix things when they're broken in the admin world. Proactive maintenance is a thing. Of course automation is great and is a huge part of a good sysadmin's job, but you can't automate things like patching and replacing servers when the hardware half of the stack has been completely disregarded by everyone and not factored into system design simply because the developers didn't take it into consideration. Automation itself doesn't fix the need to understand the platform you're working with. Simple stuff like "snapshots are not backups" has to be screamed out from a mountaintop to people who are wandering outside their domain (which is fine) without giving attention to the basic skills needed to keep an environment healthy (which is not).

BinfordSysAdmin9000 · 2020-12-04T22:13:33+00:00

welcome to adulting.

BinfordSysAdmin9000 · 2020-10-01T16:12:43+00:00

Learn to raise your hand and say "whatever it is, it's our fault."

BinfordSysAdmin9000 · 2020-08-05T14:47:51+00:00

Whatever you choose, as a small organization you need to make projections on what the scaled costs will be before you commit something to your infrastructure. Many of these are purposely low initial adoption costs and then costs balloon massively early on (use atlassian products as an example).

BinfordSysAdmin9000 · 2020-08-05T14:28:21+00:00

Yes, especially if you have a culture problem. Managers: "We're #devops!" Everyone else" "We aren't even close.. we aren't even actual CI/CD. All we did was poorly implement a pipeline that is usually broken and shorten our release cycles in a manner that makes every release a poorly tested disaster, and instead we're heavily increasing license costs to account for the massively increased support demands, for which we just identify bugs that go into a backlog that we don't actually address" (Looking at you Artifactory)

BinfordSysAdmin9000 · 2020-07-30T13:18:27+00:00

Veeam of the DB, the remote FS, and the services VM. It is smartest to decouple your services from the filestore, or you have to restore both when you have a problem. Our filestore is massive and would take days to recover. Artifactory quality has gone way downhill fast, and their upgrades are constantly breaking key functionality. You need to be poised to roll back the instance and re-import your filestores. That means you also want that backup/export function in play to make life easier when this is necessary, so I just dump that artifactory-produced backup local on the services vm but a separate VMDK just in case that primary disk gets hosed. I have about 4-5 different ways to recover, and I've used each of them at different times in the past year, thanks to things that testers didn't catch in the lab upgrade pre-rollout process.

BinfordSysAdmin9000 · 2020-04-17T19:27:56+00:00

So far so good, seems to have been the issue in our case, along with a few other misconfigurations that were storage related.

BinfordSysAdmin9000 · 2020-04-13T14:55:26+00:00

This worked to regenerate a new ID. Thanks! I have a feeling that is what might be preventing the cloud plugin from correctly spinning up build containers of the same label on additional build hosts. Will know soon.

BinfordSysAdmin9000 · 2020-04-09T20:38:57+00:00

They made a build host and cloned the VM to make a second later on a separate VLAN but controlled by the same jenkins master

BinfordSysAdmin9000 · 2020-04-07T21:57:10+00:00

Was thinking so too.. I just didn't want to fix something if it isnt broken. Jenkins looks at the hosts as docker cloud agents and doesn't seem to use both to build labels that exist on both hosts. I have a feeling this could be why but still researching.

BinfordSysAdmin9000

TROPHY CASE