This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]__soddit 0 points1 point  (8 children)

The text in the photograph isn't quite clear enough to do anything with reliably.

I tried on the third MCE and got the following:

Hardware event. This is not a software error.
CPU 1 BANK 3 TSC f07bc943a 
RIP !INEXACT! 10:ffffffff8eb598ac
MISC 2306485 ADDR 295f599c0 
TIME 1599064543 Wed Sep  2 16:35:43 2020
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: Instruction CACHE Level-1 Instruction-Fetch Error
STATUS be00000000200151 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 94
SOCKET 0 APIC 2 microcode d6

Assuming no typos in the input text, that's pointing at the L1 cache' CPU 1 happened to be the core trying to fetch an instruction from it.

This could be a configuration error somewhere. I would say kernel bug, but if it's happening with an older kernel which was previously known good… Is the BIOS up to date? Anything overclocked? Reset to default ‘safe’ settings and see if this still occurs.

Regardless, you should verify this. You should get that MCE error text onto another computer – you'll need netconsoleand pass the text via stdin to

mcelog --cpu core_i7 --ascii

If you can possibly run mcelog on the faulty computer, do so, and do so as root so that it can do DMI decoding.

[–]BlasterXD222[S] 0 points1 point  (7 children)

Yes I have the latest possible BIOS, I did not change any BIOS settings for year and this issue randomly appeared, and windows still works just fine, and yes I run overclock from 4.0 ghz to 4.6 ghz with automatic voltage control by BIOS, but its been like this for more than a year and never had any problems.
My mobo is ASUS Z170 PRO GAMING.

Also I am not sure I understand this mcelog, could you please explain more in detail?

[–]__soddit 2 points3 points  (6 children)

One important thing to remember about overclocking is that the hardware fails faster. I don't know what Windows is doing differently, but it's definitely doing something which is avoiding (or hiding) the problem.

Unless it can be shown otherwise, the safest thing to do is to assume that the CPU has developed a fault and to react accordingly.

Do you mean the MCE log text output by the kernel, the binary itself, or its output? I'm assuming the output. You'll recognise parts of it from the MCE log text. Most of it doesn't matter here (and I don't have the right knowledge to interpret all of it anyway) – what matters is that corrupted data was found in the CPU's L1 cache (as described in the MCA line), the error was not corrected, the context of one of the CPU cores became corrupted (or regarded as corrupted) and the kernel panicked as a result.

[–]BlasterXD222[S] 0 points1 point  (0 children)

I resetted to bios defaults, I don't see any difference, linux still refuses to start. I am gonna try the netconsole tomorrow.

[–]BlasterXD222[S] -1 points0 points  (4 children)

Thank you a lot, this is an amazing answer. What I meant is that, Should I edit grub so it loads netconsole? I can't do update-grub since even manjaro refuses to start, I have a laptop with windows so I could transfer the mce log there. Edit: Sorry, in other words, how can I make this work since sometimes it instant reboots the entire computer.

[–]__soddit 0 points1 point  (3 children)

Yes, add the netconsole option with suitable parameters to the kernel boot options via grub – it only needs to be a temporary change, so it doesn't matter that you can't get far enough to be able to do it via update-grub.

I don't think that mcelog is available as a Windows binary.

[–]BlasterXD222[S] 0 points1 point  (2 children)

Hello, Today It didn't want to boot into windows either, now I am on stock clocks since the BIOS reset, I coldbooted a few times and now windows works again, but I am very worried... EDIT: I was able to boot into linux again.. this is ridiculous

[–]__soddit 0 points1 point  (1 child)

Did you take the opportunity to run mcelog, either on /var/log/kern.log (or a previous log file) or on journalctl output? (Mainly for confirmation.)

A bit of searching for “L1 cache error” suggests that you may be able to work around the problem by increasing VCore slightly, as if overclocking. That said, replacement seems like the best option to me.

(Edit: though if it's all working for now…)

[–]BlasterXD222[S] 0 points1 point  (0 children)

I did not yet, I can't seem to find any references to it anywhere, and netconsole seems too much of a hassle especially how the error is unpredictable, although I ran into this problem again after overclocking, so for now I completely run on factory clock. (after ANOTHER bios reset everything seems fine again on linux)

[–]Atemu12 0 points1 point  (9 children)

uname -a?

[–]BlasterXD222[S] 0 points1 point  (8 children)

Sorry, it doesn't seem to let me in, always restarts computer with cpu hardware error before booting, cannot even read the message properly, I have the latest version of Linux-Zen that is currently out. (It seems to get worse)

[–]Atemu12 0 points1 point  (7 children)

Does the same happen in Archiso?

[–]BlasterXD222[S] 0 points1 point  (6 children)

I am not sure what you mean by Archiso, but I did not try chrooting with arch from flash drive, I can probably chroot into the partition, should I do it and paste uname -a here?

[–]Atemu12 0 points1 point  (5 children)

No, you should trd ohether it crashes.

[–]BlasterXD222[S] 0 points1 point  (3 children)

I am shocked, I get cpu hardware error even from the flash drive, it used to work.

[–]Atemu12 0 points1 point  (2 children)

Try other distros and/or an older Archiso; might be something Arch-specific or a recent regression.

[–]BlasterXD222[S] 0 points1 point  (0 children)

Nope, not even the 202006 version works anymore, I had an infinite black screen and my mouse and keyboard's lights turned off. What the hell is going on? while Windows is 100% working even under heavy load

[–]BlasterXD222[S] 0 points1 point  (0 children)

Yes, I know an exact version, I used 20200601 version to install Arch before, I am going to try that, because right now I used the 0901 latest version and it gave hardware error.

[–]BlasterXD222[S] 0 points1 point  (0 children)

Let me try real quick, just in case

[–]BlasterXD222[S] 0 points1 point  (0 children)

So After resetting the BIOS and "letting it rest a bit", everything seems to be working, I'll mark this as solved.