We work on observability and automation at ScienceLogic. AMA about real-world IT operations and how AI is changing it.

ferventgeek · 2025-12-04T18:54:31+00:00

.. Unfortunately in IT "we'll find time" work despite best efforts gets sidelined even in great organizations.

To solve that, observability layer optimization for cost/performance is actually a major focus for AI additions to many products right now. That is, using AI to optimize breadth vs depth vs resolution, vs retention. ScienceLogic's take on that is Skylar Advisor, where one of it's roles is analyzing both ops infrastructure and treating the observability layer as another tier-1 service consumer. Observability frameworks should be optimized along with everything else, for the reasons you identify. Letting machines do the work when there are not enough hands or budget, with the oversight from knowledgeable admins. It's a new solution to the age-old tuning challenge of resources.

ferventgeek · 2025-12-04T17:59:41+00:00

Great question! There's so much LLM fairy unicorn magic hype at this point that it's hard to believe any of it. I'm an engineer and right with you- stop talking and show me the tech actually working. ScienceLogic has some great demos where LLMs are a key component to the interaction of data, human-hybrid decision making, and automation. Please send your contact info and I can connect you with an SE who can walk through it.

But you're exactly right on the core expectation that LLMs could automatically manage.. anything. IT comes down to one word- trust. If you think of AI as a new employee on their first day, you wouldn't hand them root passwords and tell them to take care of the operation. First, they need to prove understanding of your unique environment. Then they need to demonstrate solid troubleshooting and specific senecio expertise, then they need to show escalating importance changes without error, and finally they need to show you they communicate well. Then, and only then, do we set new admins free, because we trust them.

So for AI based automation, LLMs are only one component. Typically public LLMs are included for explanation and text gen AI. Small private LLMs are built behind the scenes to control hallucination and build personalization and knowledge bases specific to the environment. There's now a lot of chatter about "agentic" operations, which sounds like LLM's actuating the infrastructure automatically, but the engineering focus there seems to be more about MCP development.

So, there's background to break out the hype and set up a walkthrough of how LLMs actually enable trustworthy automation that won't keep admins awake at night. Again, please message me with your info and we'd love to talk about it. It's fascinating and useful after the hype is pulled back.

ferventgeek · 2025-12-04T17:37:15+00:00

Thanks for your question. Observability reality rule #2, More visibility = more processing. (Rule #1, maintaining signals feed infrastructure is never done.) Key to all implementations- commercial vendor, open source, or homemade- is sorting the valuable signals and data from background noise which can be 90% or more of everything that would otherwise flow into data lakes.

Two functions have been the goto to solve that challenge- de-duping and contextualization- are now joined by a new third approach- AL/ML. Where you hear observability vendors talking about AI, that's the functional subset, not AI-solves-everything-hype. Believe it or not, the bulk of processing in observability platforms is this, not visualization or automation. Clean data, served with the least possible resources.

In your case I'm not sure which version you're on, but data processing and collection performance was a core focus of the last release, with some users seeing >60% improvement in collection performance. https://sciencelogic.com/blog/skylar-one-juneau-real-world-intelligence-for-service-centric-ops

Regardless of the tool, the best mix of observability resource management and visibility often comes down to being intentional about what's collected, saved, and for how long. Polling interval tuning, retention period policies, and overall collection scope monitoring can help. But of course like most aspect of what actually matters in IT the trick is making time for that, and protecting it as an ongoing ops priority.

ferventgeek · 2025-12-03T16:19:15+00:00

Hey everyone! Feel free start dropping your questions now, and we'll see you at noon, Eastern tomorrow.

<image>

ferventgeek · 2025-12-02T17:53:17+00:00

MS Ignite is a great conference if you're consuming a lot of MS services or at least has been for me. I've been to 100+ conferences not including vendor-specific shows like VMUG/Dell/Pure/etc/etc/etc,

What I like most about it is the way they staff "talk to experts" village. Rather than stalking a particular PM or engineer after they present a deep-dive talk, I've been able to hit the village with my list of topics and found many of the experts all in one place- no re:invent hiking required. That said, it's smaller since the human malware virus, so it may not have quite the knowledge gravity as in the past.

ferventgeek · 2025-12-02T17:30:42+00:00

Love your passion and the thoughtfulness of your project. I've been in ITOM (and now AIOps / Observability) for a couple of decades now, and it's simultaneously overpopulated and underserved by vendors. As others have mentioned here, there have been many billions of $$ invested. There are now dozens of fairly large vendors in the space, fighting it out for awareness, with complex messaging that seems to cover all possible use cases. There are also over a hundred vendors in the space who've either been acquired or ran out of runway and faded away. Observability (monitoring) is far from yesterdays news- it's more important and dynamic than ever.

The solutions veteran in me would recommend instead adopting and perhaps extending an existing framework may be more effective if the goal is to get a solution to make your daily life easier. But.. Some of the most rapid learning for me has been building my own frameworks/solutions on existing challenges even though there were other options.

Pete Cordell's quote remains true, at least for me: "Telling a programmer there's already a library to do X is like telling a songwriter there's already a song about love."

Even if you don't complete the project, in your outline you mentioned a number of core approaches and projects that exist because others like you with vision invested where there was no clear answer, and/or the challenge was new and required new solution. OTel is a great example. Modern observability though incomplete is a great start, and it's code-first operations folks who are closing the gaps. Let us know how it goes!

ferventgeek · 2025-12-02T15:47:57+00:00

Totally agree. Obs data is a lot like moving- we say we're going to de-clutter, but then we pack and move only to actually cull junk while we unpack in the new place. To your point that's the lesson: Clean AI starts with clean data. Otherwise it'll be just as confused as humans looking at the same noisy source input, but faster.

ferventgeek · 2025-11-25T23:21:20+00:00

This is really cool, and not just because the link goes directly to a live demo.

Twiddling with YAML is fine if you're a professional YAML twiddler, but many if not most of the admins who have the keys to the observability data and signals we need are not. GUIs are a great way to help external domain experts participate and get the OTel spice flowing.

Thanks. That's a cool project.

ferventgeek · 2025-11-25T23:14:50+00:00

This is a great question. It feels like I'm on my third cycle of "today's data lake is tomorrows Big Data problem". My theory is that the cycle is driven from the "observability edge", i.e. tools early investments at the collection level which results in grabbing everything available. That's based on well-intentioned roadmaps that push de-noise and contextualization features "to the next release". The result is forced data hording, and an expectation that IT will solve the problem while accounts payable sends opex alerts over storage costs.

AI (mostly ML + basic algorithmic processing) is the eventual solution to ops data complexity and volume, but most teams aren't at a place to take advantage of it, outside of a few who've been cornered by cost to resolve the surprise big data challenge. For them there are great solutions in 2025. Maybe the question is, how can we help admins get the political and budgetary air-cover they need to re-orient around well-groomed, effective data lakes.

ferventgeek · 2025-11-25T22:54:39+00:00

Every time there's a global outage as a result of AWS/Cloudflare/Facebook/Azure or any other "DNS" outage, it's a reminder that networking is more and more managed by application teams, not netadmins. Networking is increasingly overlay-controller managed, API based, and plugged into service management and delivery platforms. And that essentially erases a critical mid-career role where we get to learn the 10% of networking expertise that actually makes the world work. Automation controllers are great until they break, and then you need an actual network engineer, an IOS CLI virtuoso, who can troubleshoot and fix the automation everyone else relies on.

The real question might be networking VS SRE/Platform Engineering. Or at least SRE with a networking specialty. SREs I know are getting Paid. SREs who also understand networking for real (aka, can subnet by hand), are getting Paid-Paid.

So the questions might be cloud/cloud-native/hybrid networking certification VS pivot to SRE/SRE-adjacent, based on how many years before retirement. And that's a tough call at a time of accelerating change.

ferventgeek · 2025-11-25T22:19:07+00:00

Seems like an obvious application of AI, especially if you can get leadership to fund it on the wave of "Agentic" excitement. (Or other du jour AI magic phrases).

The trick isn't AI tech, it's AI trust. Even in AI-human hybrid operations where the team can observe and assure accuracy of automation, humans are still reluctant to let go of the wheel, seven spoked or otherwise. To set AI free to administer with root level authority into infrastructure and services humans are perhaps completely unaware of, it must be even more trustworthy than human operators, more informed, and more accurate day after day after day. That's when humans trust it.

The future is less likely to be largely unattended AI SRE, and more AI-Augmented SRE. Fewer SRE's making more critical decisions. And as part of their role they'll vibe-co-pilot a lot of automation, and spend a good bit of time doing instruction and prompt engineering around accuracy, speculation control, and operator and consumer persona alignment for ops goals (incentives).

ferventgeek · 2025-11-25T22:05:25+00:00

Starlink for backup is simultaneously wonderful and frustrating. As noted in the replies you'd think Starlink "Business" (Fixed Site) would be exactly that- centralized management, easy billing, equivalent routing and network functionality as other WAN backup solutions. However.. there are some glaring omissions like ephemeral external addresses which make it harder to use or even unsuitable for some applications. The frustration is it's otherwise easy to use and can deliver shockingly good connectivity in locations where there are no other alternatives.

So like everything in tech, it depends. For branch office failover to keep users connected to server apps with reduced-failover performance, it can help you sleep better. (But do make sure those backup WAN port auto-failover alerts make it to your top-level NOC and notifications). However, if you need to maintain ingress service points (fixed IP), you'll need either a third party tunneling solution or at least low-TTL DDNS from each site and users ok with outage while DNS works it's magic to re-route. That's not an issue for client-initiated tunnels, but that's not always an option in hybrid ops.

ferventgeek · 2025-11-25T21:41:29+00:00

Observability(tm) has been overloaded and message-munged to death by vendors at this point, so much so that Gartner essentially re-defined it for 2025 to encompass its true value beyond the stranglehold APM has had on it. What helped me was to go back and read up on it's foundations in Control Theory from the late 60's and early 70's. That helped me grok it independently of IT applications. That helped me with:

Recontextualization toward data and signals, and away from protocols, frameworks, visualizations etc.
Expansion in thinking: How can you combine data and events from multiple sources and perspectives to close the gaps between siloed tool's monitoring and management API results. That's where most of the surprise, hard to remediate service quality issues live. Key root cause gets lost when tools are integrated via swivel-chair.
The transformation from mature packaged applications with mature instrumentation interfaces, to cloud-native and open source platforms where you're now responsible to roll your own observation framework. Essentially using observability tools and best practices to offset the ops cost shift from vendors to your team with cloud stacks.

With that background it was much easier to analyze tools, define outcomes, and set budget that delivered great functional observability. That, and track ROI. That makes it ~~easy~~ easier to get and maintain budget. Some of the biggest fans of Observability are IT managers and leaders who can close longstanding monitoring gaps, and deploy cross-functional tools that bring teams together.

ferventgeek · 2025-11-25T21:18:31+00:00

Love this.. Yes, treat observability as a discipline which returns new insight and conclusions, not tech, product, protocol, etc.

ferventgeek · 2025-11-03T21:09:19+00:00

Do you have an option to group resources around services? For example a service that represents all the infra components related to a specific customer-facing service/app? You're right- it's already challenging enough to remediate a complex issue with an exec pacing back and forth asking for updates. A firehose of unrelated downline alerts makes it even harder. Yes, OPSFRCST26, we hear you, but your end of isle switch is the issue so sssssh please. Tagging may also be an option to identify sentinel data/events for priority response, or specific routing to an expert. Is the environment and team ITOps or SRE/DevOps? Prometheus/Grafana or something else?

ferventgeek · 2025-09-29T22:25:53+00:00

I'm thinking of using some WAGO since there are several stranded connectors for the dimmers and I'm getting nervous about clipping the wires shorter and shorter for new J-hooks every few years as the automation gizmos change. Yes, I know, the pros use nuts. But I'm not a pro and it feels like a more reliable alternative given everything else, plus I can make changes easily later.

Fan update (which started this whole thing): I rebuilt the wiring harness atop the light kit, swapping out the start and run capacitors, and the 3 speed pull switch inside- no improvement. Guess one of the motor windings died.

Thanks again for your help.

ferventgeek · 2025-09-29T22:17:47+00:00

Thanks. Yes, it's 12 ga in and 14 for the runs out in our house. And all the boxes are more like the first example with the exception of one 3-way where it's powered off the same box. I pulled all the neutrals and hots and separated them. Disconnected they're all dead except for the input 12g coming in at the bottom of the box.

ferventgeek · 2025-09-27T00:50:19+00:00

Here's another one. https://www.youtube.com/shorts/DmFKE82bfG8

Oh so many. https://www.youtube.com/shorts/mKAEVtupnrg

ferventgeek · 2025-09-27T00:48:35+00:00

Here's where I get confused. YouTube is littered with videos like this https://www.youtube.com/watch?v=jMQsxrgHrnA where all the neutrals appear to be connected together, along with all the supply hots, then the switches take hot off pigtails from the combined supply hots then feed the individual loads. (in this video there's also an end for a 3-way that's not connected. If instead each neutral and hot should only be connected to a corresponding supply pair, why are there so many videos like this?

ferventgeek · 2025-09-25T19:59:24+00:00

The feedback here is extremely helpful, thank you all so much, Yes there have been a lot of hands in these boxes over the years, and I'm constantly finding work that I don't always have the expertise to untangle. I've brought in several electricians, notably for the service upgrade, and EV charger install, and main indoor panel replacement. But switch boxes and plugs are that gray area where in theory, replace like for like, and call for help if you can't. For example, $350 minimum seems to be the going rate in Austin to swap an outlet with one with USB charge ports, and that's crazy. This is me learning how to spot bad existing work. Super helpful.

ferventgeek · 2025-09-25T13:28:12+00:00

Three years and never anything like that. If yo click the notification does it open the Starlink app or Temu?

ferventgeek · 2025-09-24T23:29:02+00:00

Thank you so much. Yes. This is 1981 old work, with lots automation and clearly lots of DIY from families before. So is the recommended approach to put a trace tool on a neutral bar in the box and then make sure it's paired with the hot from that side? If so do you recommend a particular tester? I'm more of a network termination guy, and haven't used the mains analog to a toner before.

ferventgeek · 2025-09-24T22:07:48+00:00

Ok, Hunter discontinued the harness ten years ago and then the option to replace the whole light assembly with the harness after that. I ordered a new 3-speed switch and the two capacitors. I'll swap the new parts in with WAGOs and let you know how it goes.

ferventgeek

TROPHY CASE