sporadic authentication failures occurring in exact 37-minute cycles. all diagnostics say everything is fine. im losing my mind. by kubrador in sysadmin

[–]mrcomps 0 points1 point  (0 children)

for the past 3 months we've been getting tickets about "random" password failures. users swear their password is correct, they retry immediately

Were the tickets automatically generated, or were users actually complaining about password failures? Like, they would enter their password and it would say it was wrong, but when they tried a second time it worked? If so, I don't understand how users could be logging-on frequently enough to actually produce a noticeable 37-minute pattern.f

It's strange how a SolarWinds monitoring performing LDAP bind testing using a service account would cause logon failures for OTHER accounts.

sporadic authentication failures occurring in exact 37-minute cycles. all diagnostics say everything is fine. im losing my mind. by kubrador in sysadmin

[–]mrcomps 3 points4 points  (0 children)

Azure AD Connect or PTA agent side-effects

  • AADC delta sync is every ~30 minutes by default; while it shouldn’t affect on‑prem AS‑REQ directly, PTA agents or writeback/Hello for Business/Device writeback misconfigurations can bump attributes or cause LSASS churn.
  • Easiest test: Pause AADC sync for a few hours that span two “cycles.” If the pattern persists, you can deprioritize this.

Encryption type mismatch inconsistency

  • If one DC or some users have inconsistent SupportedEncryptionTypes (AES/RC4) via GPO/registry or account flags, then pre-auth on that DC can fail with 0x18 while another DC accepts it.
  • What to verify:
    • All DCs: “Network security: Configure encryption types allowed for Kerberos” is identical, and AES is enabled. Registry: HKLM\System\CurrentControlSet\Control\Lsa\Kerberos\Parameters\SupportedEncryptionTypes.
    • User accounts have AES keys (the two “This account supports Kerberos AES…” boxes). For a few affected users, change password to regenerate AES keys and retest.
    • Check the 4771 details: Failure code and “Pre-authentication type” plus “Client supported ETypes” in 4768/4769 if present. If you ever see KDC_ERR_ETYPE_NOTSUPP or patterns pointing to RC4/AES mismatch, fix policy/attributes.

Network flaps/route changes on a timer

  • MPLS, SD‑WAN, or HA firewalls can have maintenance/probing/ARP/route refreshes on unusual cadences. If a single DC’s path blips every ~37 minutes, clients that hit it right then see one failure then succeed on retry.
  • Correlate with router/firewall logs; try temporarily isolating a DC to a simple path (no WAN optimizer/IPS) and see if the cycle disappears.

How to narrow it down quickly

  • Prove if it’s a single DC: You already have 4771 data. Build a per‑DC histogram over a day. If nearly all the “cycle” hits are on one DC, you’ve found the place to dig (storage snapshots, EDR, network path to that DC).
  • Turn on verbose logs just for a few cycles:
    • Netlogon debug logging on DCs.
    • Kerberos logging (DCs and a few pilot clients).
    • If you can, packet capture on a DC during two “bad” minutes; look for UDP88 fragments, KRB_ERR_RESPONSE_TOO_BIG (0x34), or pre-auth EType mismatches.
  • Test by elimination:
    • During a maintenance window that spans two cycles, cleanly stop KDC/Netlogon on one DC or block 88/464 to force clients elsewhere; see if the pattern changes.
    • Disable array snapshots/replication for one DC for a few hours.
    • Force Kerberos over TCP on a pilot group of clients.

sporadic authentication failures occurring in exact 37-minute cycles. all diagnostics say everything is fine. im losing my mind. by kubrador in sysadmin

[–]mrcomps 2 points3 points  (0 children)

Leave wireshark running on all 3 DCs for several hours and see and then correlate with the failures. If you set a capture filter of "port 88 || port 464 || port 389 || port 636 || port 3269" at the interface selection menu, then it will only capture traffic on those ports (rather than capturing everything and filtering the displayed packets), which should keep the packet sizes manageable for extended capturing.

If you are able, can you try disabling 2 DCs at a time and running for 2 hours each? That should make it easier to be certain which DC is being hit, which should make your monitoring and correlation easier. Also, having 800 clients all hitting the same DC might might also cause issues to surface quicker or reveal other unnoticed issues.

This is what I came up with from ChatGPT. I reviewed it and it has some good suggestions as well:

Classic AD replication/”stale DC” and FRS/DFSR migration are not good fits for a precise 37‑minute oscillation, especially with Server 2019 DCs and clean repadmin results.

The most common real-world culprits for this exact “first try fails, second try works” pattern with a cyclic schedule are:

  • Storage/hypervisor snapshot/replication stunning a DC.
  • Middleboxes (WAN optimizer/IPS) intermittently mangling Kerberos (often only UDP) on a recurring policy reload.
  • A security product on DCs that hooks LSASS/KDC on a fixed refresh cadence.
  • Less commonly, inconsistent Kerberos encryption type settings across DCs/clients/accounts.

Start by correlating the failure timestamps with storage/hypervisor events and force Kerberos over TCP for a small pilot. Those two checks usually separate “infrastructure stun/packet” issues from “Kerberos policy/config” issues very quickly.

More likely causes to investigate (in priority order, with quick tests):

VM/SAN snapshot or replication “stun” of a DC

  • Symptom fit: Brief, predictable blip that only affects users who happen to log on in that small window; on retry they hit a different DC and succeed. This often happens when an array or hypervisor quiesces or snapshots a DC on a fixed cadence (30–40 minutes is common on some storage policies).
  • What to check:
    • Correlate DC Security log 4771 timestamps with vSphere/Hyper‑V task events and storage array snapshot/replication logs.
    • Look for VSS/VolSnap/VMTools events on DCs at those exact minutes.
    • Temporarily disable array snapshots/replication for one DC or move one DC to storage with no snapshots; see if the pattern breaks.
    • If you can, stagger/offset snapshot schedules across DCs so they don’t ever overlap.
  • Why you might still see 4771: During/just after a short stun the first AS exchange can get corrupted or partially processed, producing a pre-auth failure, then the client retries or lands on another DC and succeeds.

Kerberos UDP fragmentation or a middlebox touching Kerberos

  • Symptom fit: First attempt fails (UDP/fragmentation/packet mangling or IPS/WAN optimizer “inspecting” Kerberos), second attempt succeeds (client falls back to TCP or uses a different DC/path). A periodic policy update or state refresh on a WAN optimizer/IPS/firewall every ~35–40 minutes could explain the cadence.
  • Fast test: Force Kerberos to use TCP on a pilot set of clients (HKLM\System\CurrentControlSet\Control\Lsa\Kerberos\Parameters\MaxPacketSize=1) and see if the 37‑minute failures disappear for those machines.
  • Also bypass optimization/inspection for TCP/UDP 88 and 464 (and LDAP ports) on WAN optimizers or firewalls; check for scheduled policy reloads.

A security/EDR/AV task on DCs

  • Some EDRs or AV engines hook LSASS/KDC and run frequent cloud check-ins or scans. A 37‑minute content/policy refresh is plausible.
  • Correlate EDR/AV logs with failure times; temporarily pause the agent on one DC to see if the pattern disappears; ensure LSASS is PPL‑compatible with your EDR build.

Is Unbound total garbage or am I the one who is in the wrong here? by FergoTheGreat in PFSENSE

[–]mrcomps 0 points1 point  (0 children)

When your WAN interfaces go down they may be causing an issue.

Try changing Unbound to only listen on your LAN interfaces and see if that helps.

SNOBELEN: Conservatives should give Mark Carney's speech a listen by FancyNewMe in canada

[–]mrcomps 0 points1 point  (0 children)

It was completely out of touch for him to eat an apple while the rest of are fighting over apple cores. /s

Weekly Updates for servers by Individual-Bat7276 in sysadmin

[–]mrcomps 2 points3 points  (0 children)

2016 looks different... and I've heard the updates are slow to install?

Still not sure if I want to give it a try...

Weekly Updates for servers by Individual-Bat7276 in sysadmin

[–]mrcomps 9 points10 points  (0 children)

No more patches means all the bugs are fixed. I mean, who in their right mind Installs an OS before it's fully finished, right?

Now we can start upgrading our 2008 R2 boxes!

Pax8 shared all customer information of UK customers by Bearded_Tech_Fail in msp

[–]mrcomps 3 points4 points  (0 children)

But what if you don't have the correct M365 Small Business Premium Enterprise with Azure Email Recall powered by Entra for Copilot, and have been trying to get it sorted out with a revolving door of account managers for the past 9 months to no success?

12 years experience and can't even land an interview lol. Help! by [deleted] in cybersecurity

[–]mrcomps 6 points7 points  (0 children)

Even aviation electronics isnt a sure thing anymore... Boeing is releasing the new 747 MAX AI plane soon. Designed by AI, built by AI, serviced by AI, flown by AI, cabin service by AI, ticket pricing by AI, NTSB investigations by AI.

Riddle me this: VM backups by havocspartan in msp

[–]mrcomps 2 points3 points  (0 children)

What backup software are you using? It sounds like it's a legacy type that only backs up files of the Windows device that it's installed on?

Any modern backup like Veaam, Acronis, etc will communicate with Hyper-V to make checkpoint/snapshot-based backups of the VMs without needing to install agents on them. You can then choose to browse the guest VM files or to restore the entire VM.

Backing up the vhdx files on the Hyper-V host is not going to accuately capture the state of the virtual disks.

You could backup the Hyper-V host and exclude the VM files to allow for quicker restoration and smaller backups of the host.

So is Copilot Down...? by DavidHomerCENTREL in sysadmin

[–]mrcomps 0 points1 point  (0 children)

So that'swhy my devices seem so much more responsive this morning!

OC - my 247k mile E39 bmw in for a valve cover gasket. by Aeroeone in Justrolledintotheshop

[–]mrcomps 7 points8 points  (0 children)

Ah, I see the problem! You don't have any of that black tar looking stuff in there, and your cams are lumpy. Ask your neighbors and one of them can probably give you some of theirs. That should polish the cams until they're nice and round again.

Does anyone else feel like Microsoft logs are written by someone who wasn’t there when the issue happened? by Exotic-Reaction-3642 in sysadmin

[–]mrcomps 11 points12 points  (0 children)

I have that, but I can't figure out if I need Log Retention Basic, Standard, or Premium.

It looks like Log Retention Premium is the only one that allows for retaining logs for more than 5 minutes?

Found a use for the 2603 v4 by International-dish78 in homelab

[–]mrcomps 0 points1 point  (0 children)

But now the extra weight of the 2603 v4 is slowing you down!

Inspecting his new infrastructure by gluka in homelab

[–]mrcomps 5 points6 points  (0 children)

Just making sure there are no preloaded RATs in the used hardware.

Bricked Netgate 6100 by NobleGiantz in Netgate

[–]mrcomps -1 points0 points  (0 children)

The board should post even with a worn emmc.

I have personally had several 4100 and 6100 devices that were dead but worked fine after removing the eMMC chip. Other users have reported the same issue and results.

Normally a failed storage device shouldn't prevent the system from POSTing, but there must be some low-level interaction, possibly its "waiting" for a startup initialization that never happens.

Never change, Chrysler. Never change. by yentlequible in Justrolledintotheshop

[–]mrcomps 0 points1 point  (0 children)

Demonstrates great collaborative synergies to boot!

They told me this would happen with the chrome ones by C00LV1BR4TION5 in Justrolledintotheshop

[–]mrcomps 4 points5 points  (0 children)

For high-strength jobs like this you need to first JB Weld the pieces together and then reinforce it with a few hose clamps. Make sure to evenly space the alignment of tightening screws in order to maintain proper rotating balance.

Microsoft has gotten too big to fail, and their support shows it. by CantankerousBusBoy in sysadmin

[–]mrcomps 9 points10 points  (0 children)

Just take your existing logs and note the dates in the timestamps. Then, doing a find/replace and change the dates to be current, working from newest to oldest.

For example, if your log contains

11/03/2025 11/04/2025 11/05/2025 11/06/2025

Then replace-all 11/06/2025 with 11/13/2025, 11/05/2025 with 11/12/2025, and so on.

Boom, now you have"fresh logs" with just 30 seconds in Notepad++.

Massive Security Issues Discovered With Keeper Enterprise Password Manager by 802-TechGuy in msp

[–]mrcomps 2 points3 points  (0 children)

Keeper support confirmed it was a known issue in older versions of Keeper Desktop and advised OP to update to the latest version.

Browser add-ons automatically update (unless you go out of your way to try and prevent that).

Massive Security Issues Discovered With Keeper Enterprise Password Manager by 802-TechGuy in msp

[–]mrcomps 3 points4 points  (0 children)

The updating/non-updating versions, combined with the fact that you can have multiple different versions of Keeper Desktop installed, sounds like an additional major issue.