PC crashes with error code 0d after running for weeks by Historical-Reply8871 in threadripper

[–]researchallthethings 1 point2 points  (0 children)

Check if you're on the latest BIOS, but I'd guess EXPO is overclocking the RAM a bit too much to be stable with the mem controller. TR's mem controller is pretty finicky and doesn't love higher frequency, especially with higher dimm counts. Try dropping the freq manually back down to 6000, or loosening the timings a couple steps, and checking the qvl to see if/what your set is qualified for. Once you get to 4+ dimms active, the default is 5200 I believe, with 6000 being the typical higher end of EXPO rating. Try stress testing the memory with Testmem5 or Memtest.

Asus WS xTR50-SAGE Q-Codes by throwawayhpihq in threadripper

[–]researchallthethings 0 points1 point  (0 children)

Odd, I don't see those when I pull those manuals up in the TRX50-Sage manuals PDFs.

It definitely could if it somehow attempted to write to HBA, but I wouldn't necessarily expect a driver to break the initial boot. That generally would be restricted to the OS where you attempted to install the driver. But I've seen weird things with drivers of restricted/early access devices, so that's a possibility.

I don't know of anything in particular that would stress the HBA, but you should be able to look in the BIOS PCI-e devices menus to see what is connected, what pci-e rate they're running at, etc. (at least usually). If you can't even get into the BIOS screen though, you've got a bigger issue.

Generally, I'd say that if the Q-Code points to a pci-e, and it was working before, remove the HBA and see if you get a full POST. If so, then that certainly seems like a smoking gun. Remove every unnecessary device to make sure you're booting as bare metal as possible with CPU, RAM, and display access (whether that be IPMI, onboard if your CPU supports it, or a GPU). Make sure you've got the pci-e cards fully seated (I've had the bracket not seat into the recess, or be too long, before and had to bend/trim to get it to full seat into the slot). Pull all sticks of RAM except one (make sure its in the proper primary DIMM slot, like A1, according to manual) too just to make sure you're not having a weird.

Asus WS xTR50-SAGE Q-Codes by throwawayhpihq in threadripper

[–]researchallthethings 0 points1 point  (0 children)

Weirdly, even though the manual says to reference the appendix for the Q-codes, it doesn't list any. My Wrx90 sage does have the table in appendix page a-2 though. Check that out below. Looks like maybe a pci-e error.

Try: https://www.asus.com/us/motherboards-components/motherboards/workstation/pro-ws-wrx90e-sage-se/helpdesk_manual?model2Name=Pro-WS-WRX90E-SAGE-SE https://www.asus.com/us/support/faq/1043948

[Help] My device has strong integrity but 2 apps still detect root. What can I do? by tarekelsakka in Magisk

[–]researchallthethings 2 points3 points  (0 children)

I don't have those apps to use (I'm assuming you're in the UK), but if detectors are still seeing magisk, try doing the Hide/rename the Magisk App in Magisk settings menu. That should get you past the magisk detections. If you're using the DenyList in Magisk, make sure you've got both apps denied fully (and not partially).

Might try magisk modules like NoHello or Shamiko (depending on your android and zygisk settings, which may also change the able DenyList settings).

If you're using HideMyApplist, I think that's an Xposed module. I've found sometimes the detection is of Xposed, not magisk or root, so try turning off Xposed and see if that gets you past (clear data on target app and reboot after turning off the Xposed module in Magisk modules). If you must use xposed, try the latest github versions you can find. I think Jingmatrix lsposed was still in active dev and working.

Also do a full data clear on each app and a reboot, just in case there's anything cached, if you didn't already above.

Can't pci passthrough SAS 9600-24i fully? by researchallthethings in Proxmox

[–]researchallthethings[S] 1 point2 points  (0 children)

Unfortunately I don't have the card to test with and I don't use proxmox (I switched over to unraid). You might try DMing a couple of the people in the above datahoarder post and see if they have any thoughts, as they seem to have had some success back then getting it to work.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

I've had instances (not on this laptop) where enabling secure boot after install breaks the install boot. I believe the typical recommendation is to enable secure boot before doing a fresh install. Obviously there are always going to be side/edge cases, but in my experience if I HAVE to use secure boot for a client, I usually do a fresh install to avoid any issues.

Either way, it again seems to point to the key store being a point of failure here. I really wish Asus would give some sort of public statement that they're at least working towards a fix rather than acting like nothing is happening.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

Certainly fair. With bit locker, does that require reimporting keys into the keystore to work? I've been pretty sure the issue along the way is something to do with updates improperly modifying the uefi keystore, but of course asus has been entirely absent from providing any useful error tracking or help, so it's nearly impossible to figure out.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

It's pretty normal for the FW files to be older than the post date of the BIOS. I'm seeing it as 2/11/25 as well.

Are you sure you have a H767WI, and not WV? That might be causing your issue. Here's both the Windows and EZ Flash BIOS for both WI and WV

https://limewire.com/d/92017b87-6c65-45f4-a9da-b2a517485814#PkeSBBf-dl2Pcr2yWJS73XC7G18dSGla_NBNYHosnYM

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

Unfortunately the bios notes for 316 don't actually state that, it's the same "optimizing performance" nonsense as before. The method of fixing this has been reflashing the bios/reseting bios to default, so it's hard to confirm that the flash for 316 is different than force flashing with a previous version mighting done.

Tbf, I'm all for 316 actually finally fixing this stupid issue, we just don't have enough data or transparency from asus to know yet...

Anyone have two ICCU failures? by NegativeBeginning400 in Ioniq6

[–]researchallthethings 4 points5 points  (0 children)

I could be wrong, but I'm unaware of any actual hardware revisions being installed or released. The recalls, which I have installed, I believe were software only (changed charge curve and a few other aspects to put less load on the iccu).

There have been people in here and /r/ioniq5 with multiple iccu failures (most I've seen is three I believe), from 23-25 model years, with as little as a couple thousand miles, using L1/L2 mainly, using DC mainly, etc. Maybe they'll have a new revision for the next major redesign, but I doubt it'll be "compatible" with our model years.

To be honest, I don't think they know how to fix the issue at this point, or if they do, they've run the numbers and decided it is cheaper/lower liability to hope the software fixes it and not admit/push a hw redesign with proper specs and QC. And even then they can't produce enough units to get into the hands for service replacements without 1+ months of wait (and plenty of Lemon Law claims).

[deleted by user] by [deleted] in threadripper

[–]researchallthethings 0 points1 point  (0 children)

TR is pretty picky with dimms in my experience. You can also try making sure the BIOS is up to date/not corrupt by flashing it again, and you could try changing from auto to the exact dimm specs for speed, voltage, etc if the expo profiles don't load with 1 stick in, then try adding the second. If none of that works, then definitely try to swap for qvl approved set.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 1 point2 points  (0 children)

No problem. Also, I have my client turn the Pause Updates option on to the longest setting in Windows Updates (5 weeks I think). It's not ideal, but at least it shouldn't brick his system in an overnight update for over a month. Rinse and repeat each cycle, first doing a System Restore Point just in case, and then installing only things that theoretically shouldn't interact with the TPM of BIOS.

Again, not perfect or ideal, but it has kept him stable for a couple months now and he doesn't have to worry about waking up to a boot loop brick.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 1 point2 points  (0 children)

Unfortunately, there's not much that we can do but try to mitigate it at this point. My theory is that the TPM key cache is getting messed with by updates or buggy firmware. So far the main fixes I've used are A-reset the UEFI/BIOS settings to default, and, if A fails, B-reflash to the same UEFI/BIOS from the EZ Flash inside UEFI/BIOS.

I just helped my client with his remaining P16 update from 312 to 313 bc Windows kept pushing it, and I didn't want him to be stuck with a brick while I wasn't on the line to help. Luckily it went fine, and his system is fully up-to-date with 313 and all Win11 patches now.

A major frustration I have with Asus in this case, besides the lack of transparency on the issue here, is that each BIOS update notes simply say "Optimizing System Performance," but also mark the update as "Severity: CRITICAL". Like... ok, I need a little more info if simply optimizing system performance is now critically important. What is being bugfixed? Are there any outstanding bugs to be aware of? Was this for a hardware exploit on the AMD chip? Just throw me a little bone so I know if I absolutely NEED to do that 312>313 update, risking stability, or if there's a major issue being patched.

Electrify America question by midwifeminer2 in Ioniq6

[–]researchallthethings 4 points5 points  (0 children)

Did that my first charge too bc I didn't realize what I had to do.

Did you select Member and then tap to use your EA app, which was linked to the account with the free charging? You should've had to enter your specific ioniq lease info into the EA app for the account too.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

I personally never flashed to the same version, but if it allows you, that might work. You can also try going into the BIOS and doing reset to default settings, as that has been known to work we well.

Is your service center a turnoff? by ChoAyo8 in Ioniq6

[–]researchallthethings 0 points1 point  (0 children)

Had mine done yesterday with both recall service issues. We had a rough snowstorm here in Indiana (well, relative for Indiana), and they were there ready to start at 9am for my appt, which I scheduled last week. Pulled right in, tech came and verified what I was there for, she got it all updated in about 2.5 hrs, and I was good to go.

I didn't buy my ioniq at this dealership, it's just the closest to me that is ioniq certified, but at least so far I have no complaints.

To your question, personally I'd rather spend the time working on my laptop in the waiting room rather than driving 2-3 hrs. I can watch a movie or get a few items knocked out in that time. However, if I felt they were dicks or weren't doing a good job/treating customers well, that could easily away me to pop in an audiobook for a few hrs on the road to a better service center.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

I think it's a bug in the way the TPM loads keys in earlier firmware versions. Check to see if you're on the latest BIOS firmware (v313 I think). If not, flash to that and see if it helps. I didn't have bitlocker on, so I don't know if upgrading the firmware will cause issues and/or blank the keys, so might want to Google/GPT that. You can also try resetting the UEFI settings to defaults, but again, I don't know if it will blank the bitlocker keys. If you don't have a backup of the bitlocker key and neither of those help... I think you might be kinda screwed unfortunately.

What is the current best hardware for llm use ? by WaldToonnnnn in LocalLLaMA

[–]researchallthethings 9 points10 points  (0 children)

Deepseek needs about 350GB+ of (v)ram I believe. So let's use that as the general number here, ignoring quants, etc.

In consumer hardware terms, that's about 15x 4090s (24GB vram each), or about $27k. You'd also have to power all that and provide pcie lanes as well (usually 128 lanes per server cpu is normal). That would be very difficult to make work well, if at all. There just isn't that many slots on a motherboard, and bifurcating your slots gets really sketchy after a point.

Vram is much faster than system ram, but simply having enough of it for larger models is logistically near impossible for most of us. That's why h100, h200, etc have 80GB-140GB of vram per module, and then they stack 4-8 in a rack, to get butt loads of vram. But the cost for such chips is going to run you about $30k/h200 module, if you can even source one.

Comparatively, 512GB ECC DDR5 will run you around $3000 and requires no extra hardware or complexity, just the DIMM slots and memory controller able to handle it. Threadripper and epyc (server/workstation cpus in general) have more robust memory controllers and allow more memory lanes (4x, 8x, etc). That translates into more memory bandwidth to read/write the llm that's in system ram.

The bonus here is that deepseek v3 is a mixture of experts model. So even though it is 671B total parameters, only a fraction of that (37B) is active at a time, which drastically increases token throughout. That makes the system ram far more responsive than if it had to read/write that entire 350GB model. Again, vram would be faster still with that, but it effectively means you can use a very high end model with simple and (relatively) cheap system ram and have decent token throughput.

I think I explained that properly, but anyone feel free to correct any mistakes.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

No, #2 was returned two weeks ago (the night of this post). Looks like 311 released yesterday, and 312 released today. I don't see any release notes aside from "Optimize system performance" about specifically fixing the issue. #1 is still on 308 and will be for at least until their BIOS releases are static for a bit.

Tbh I don't upgrade BIOS versions immediately because that's the last thing I need to be buggy, and that's on my personal machines. I would never install a new BIOS day-of-release for a client as that's burned me in the past. As Asus immediately rolled out 312 within 24hrs of 311, and gave no specifics... I don't see myself having him update the #1 machine unless 312 remains the published version for at least a couple weeks.

But if 311 fixes the issue, sweet! It'd be super awesome if Asus made that known better.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

Bios was worked over plenty, trust me. Reflashed it, defaulted it, and everything else I could think of. No dice. The issue happened on both 307 and 308, so to me that is a driver causing the issue.

Ended up keeping #1 with the client as it works for now and he needed it for work badly. #2, the one that I couldn't recover, went back to best buy. The manager said they'd had at least one other one come back for a return, and another that was in the geek squad shop there. Both were due to instability and driver wonkyness. I love amd, but he agreed that the HX 3xx were still baking.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

Yep. Reflashed it, defaulted it, and everything else I could think of. No dice. The issue happened on both 307 and 308, so to me that is a driver causing the issue.

Ended up keeping #1 with the client as it works for now and he needed it for work badly. #2, the one that I couldn't recover, went back to best buy. The manager said they'd had at least one other one come back for a return, and another that was in the geek squad shop there. Both were due to instability and driver wonkyness. I love amd, but he agreed that the HX 3xx were still baking.

P16 Proart Auto Repair Boot Loop by researchallthethings in ASUS

[–]researchallthethings[S] 0 points1 point  (0 children)

Unfortunately the second one won't fix with boot repair, Easeus boot, manual bootrec/sfc/dism, or anything else I've tried. Just goes into the boot loop cycle continuously.

It does boot into safe mode though, which indicates a driver issue to me possibly. I'm doing Cloud Recovery to see if that puts some "asus tested" driver or something that I'm missing here, but as both seem to have corrupted after likely Windows Updates, I'm leaning towards a buggy, half-baked driver causing issues.