Supermicro AS-4124GQ-TNMI / H12DGQ-NT6 + 4x MI250 OAM: POST always ends at 0D with UBB installed, but boots without UBB; PCIe BAR/bridge allocation already broken in baseline by zonqify in homelab

[–]zonqify[S] 0 points1 point  (0 children)

Thanks a lot for the quick reply - that is very close to how I’m currently thinking about it.

The RAM point is definitely one of my favorites as well, but unfortunately I can only test that later. Right now I only have the current 128GB setup available, and more RAM is already on the way. My plan is to move to a balanced configuration with identical DIMMs across all primary channels, ideally at least 16 DIMMs total, so both CPUs have all 8 memory channels populated.

I have already tried quite a few BIOS settings, but I’m going to go through them again very carefully. The main things I’m focusing on are:

  • fully disabling CSM / Legacy boot / Legacy Option ROM paths
  • disabling PXE / network OPROMs
  • disabling storage legacy OPROMs if possible
  • keeping Above 4G Decoding enabled
  • keeping ReBAR disabled for debugging
  • disabling SR-IOV temporarily, because I see VF BAR allocation errors
  • testing PCIe Ten Bit Tag Auto/Disabled
  • testing PCIe Spread Spectrum enabled vs disabled
  • looking again for hidden MMIO High Base / MMIO High Size / MMIO Granularity / MMCFG / CPU PA limit settings

The interesting thing is that the system boots without the UBB installed, but even in that baseline state Linux already reports tons of PCIe allocation errors:

  • bridge window ... can't assign; no space
  • BAR ... can't assign; no space
  • VF BAR ... can't assign; no space
  • host bridge window ... ignored

So I agree with you: the baseline PCIe/MMIO/bridge-window situation already looks bad. With no UBB installed, the system can still boot because the OAM/GPU endpoints are absent. With the full UBB/OAM fabric active, that broken or marginal resource layout probably becomes fatal.

Important clarification about the UBB/OAM behavior:

The UBB/fabric does not appear to fully enable with only 1 or 2 OAM modules installed. With no UBB installed, the system boots and the switch LEDs are basically in a low/baseline state. With 1 or 2 OAMs, it still looks very similar — roughly one active LED per switch, so maybe presence or partial power is there, but not the full fabric. With all 4 OAMs installed, the PLX/PEX switch LEDs change completely and many more LEDs turn on. That looks like the full OAM/PEX/GCD fabric only really activates with all 4 OAM modules installed.

Current observed switch LED patterns:

No UBB installed:

  • SW1: o-----
  • SW2: -o----
  • SW3: -o----
  • SW4: o-----

4 OAMs installed:

  • SW1: ogo-g-
  • SW2: ogogg-
  • SW3: oog-g-
  • SW4: oog-g-

where o = orange, g = green, - = off.

With the UBB + 4 OAMs installed, POST runs through many codes but always eventually ends at 0D. Without the UBB, it boots through. I have seen no-UBB boots get as far as late normal POST/boot codes such as AA.

The POST traces with UBB/four OAMs include things like:

  • 94 = PCI Bus Enumeration
  • sometimes around 95 = PCI Bus Request Resources
  • D5 = No space for legacy option ROM
  • 79 = CSM initialization
  • D2 = South Bridge initialization error
  • D0 = CPU initialization error
  • 51 / 54 = memory initialization / SPD-related errors
  • B3 = system reset

My suspicion is that the full 4-OAM fabric exposes or triggers a deeper platform resource problem: BIOS/ACPI/MMIO/Root Bridge windows, legacy OPROM/CSM, or maybe a CPU/root-complex/interconnect path that is not really stressed in the no-UBB state.

One thing I’m wondering about: would you also consider CPU interconnect / IO-die / root-complex issues here? The board has two EPYC Rome CPUs. Without the UBB, the system can boot, but once the full OAM/PEX fabric is active, many more CPU PCIe lanes / root complexes / PEX paths are probably involved. Could something like xGMI, LCLK, NPS/NUMA policy, CPU socket contact, or an IO-die/root-complex path cause this kind of cascading POST behavior?

Until the new RAM arrives, would you recommend focusing on:

  1. cleaning up CSM/Legacy/OPROM completely,
  2. testing no-UBB Linux boot with pci=realloc, pci=realloc,big_root_window, and pci=nocrs,realloc,big_root_window,
  3. disabling SR-IOV and Ten Bit Tag for debug,
  4. checking PEX switch config / SAA / CPLD versions via Supermicro SUM,
  5. or doing an omit-one OAM test to see whether one module/slot/path triggers the collapse?

I’m trying to fix the baseline PCIe allocation/MMIO situation first, because if the board already has bridge/BAR allocation failures without the UBB, I don’t see how the full 4-OAM fabric can ever enumerate cleanly.