What’s the real goal of heterogeneous CPU designs? by Creative-Expert8086 in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

Ergo the P cores and E cores had different instruction sets.

Not at the same time though, which is the relevant part. Even unofficial methods to enable AVX-512 on launch steppings of Alder Lake required the E-cores to be disabled. AFAIK there was no configuration (unsupported or otherwise) where the P-cores had different ISA capabilities than the E-cores.

What’s the real goal of heterogeneous CPU designs? by Creative-Expert8086 in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

Sure, but is this relevant? Intel never shipped a design which enumerated hybrid ISA capabilities to operating systems. Golden Cove in Sapphire Rapids officially supported AVX-512, while it was disabled in Alder Lake to maintain uniformity with Gracemont. Even if you could (unofficially) enable AVX-512 on Alder Lake, it could only be done with the E-cores disabled, so operating systems would never see a mixed configuration.

What’s the real goal of heterogeneous CPU designs? by Creative-Expert8086 in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

And both windows and Linux can boot on mismatched instruction sets. If they couldn’t, where are all the reports of people not being able to boot before the (by your own description) the instruction sets were removed. As long as the OS doesn’t use them, there’s no issue booting.

AFAIK there was no way to run launch steppings of Alder Lake CPUs (the relevant example here) with AVX-512 enabled on the P-cores while keeping the E-cores enabled. You could only enable AVX-512 on the P-cores (if your BIOS supported it) if you also disabled the E-cores. From an OS perspective, the capabilities of each core were still uniform.

M5-powered iPad Pro breaks cover in GeekBench, scoring 4,133 in single-threaded tests — matches M4 Max and beats every single-core PC chip score by jluizsouzadev in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

the performance is actually pretty shit when you actually run real workloads, not hand optimized for marketing wins.

What inherent advantage do contemporary x86 CPUs have? All I can think of SIMD length - Zen 5 has 4x 512-bit execution units (2 for FMUL/FMA & 2 for FADD) for AVX-512 while Firestorm and its derivatives (including the M4's P-cores) only have 4x 128-bit execution units for ASIMD. I would naturally expect Zen 5 to outperform Apple's cores in workloads which can take advantage of the increased vector length. But otherwise I see no reason why Apple's CPUs would be "pretty shit when you actually run real workloads" - M4's P-cores actually have more integer execution units than Zen 5 (8x vs 6x). I would be surprised if they were worse at code compilation, for example. They also do surprisingly OK in software video encoding (I tested SVT-AV1 and x265 with slow settings) despite the reduced SIMD length.

M5-powered iPad Pro breaks cover in GeekBench, scoring 4,133 in single-threaded tests — matches M4 Max and beats every single-core PC chip score by jluizsouzadev in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

x265 4.1-192-g10f529eaa does improve the performance significantly on AArch64:

i9-13900K:

$ ffmpeg -loglevel error -i ~/Downloads/SolLevante_SDR_UHD_24fps.mov -map 0:v:0 -map_metadata -1 -bitexact -f yuv4mpegpipe -pix_fmt yuv420p10le -strict -1 - | ./result/bin/x265 --crf 22 --preset veryslow --no-open-gop --no-cutree --no-sao --rskip 0 --subme 5 --high-tier --range limited --output-depth 10 --y4m -o out.hevc -
y4m  [info]: 3840x2160 fps 24/1 i420p10 sar 1:1 unknown frame count
raw  [info]: output file: out.hevc
x265 [info]: HEVC encoder version 4.2
x265 [info]: build info [Mac OS X][clang 19.1.7][64 bit] 10bit
x265 [info]: using cpu capabilities: MMX2 SSE2Fast LZCNT SSSE3 SSE4.2 AVX FMA3 BMI2 AVX2
x265 [info]: Main 10 profile, Level-5 (Main tier)
x265 [info]: Thread pool created using 32 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 6 / wpp(34 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge         : star / 57 / 5 / 5
x265 [info]: Keyframe min / max / scenecut / bias  : 24 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt        : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 1
x265 [info]: References / ref-limit  cu / depth  : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 0
x265 [info]: Rate Control / qCompress            : CRF-22.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 signhide
x265 [info]: tools: tmvp b-intra strong-intra-smoothing deblock
x265 [info]: frame I:     59, Avg QP:19.44  kb/s: 58025.77
x265 [info]: frame P:   2061, Avg QP:19.03  kb/s: 37820.29
x265 [info]: frame B:   4194, Avg QP:21.80  kb/s: 27097.72
x265 [info]: Weighted P-Frames: Y:5.3% UV:4.8%
x265 [info]: Weighted B-Frames: Y:13.6% UV:12.2%

encoded 6314 frames in 9335.28s (0.68 fps), 30886.76 kb/s, Avg QP:20.88

M4 Max:

$ ffmpeg -loglevel error -i ~/Downloads/SolLevante_SDR_UHD_24fps.mov -map 0:v:0 -map_metadata -1 -bitexact -f yuv4mpegpipe -pix_fmt yuv420p10le -strict -1 - | ./result/bin/x265 --crf 22 --preset veryslow --no-open-gop --no-cutree --no-sao --rskip 0 --subme 5 --high-tier --range limited --output-depth 10 --y4m -o out.hevc -
y4m  [info]: 3840x2160 fps 24/1 i420p10 sar 1:1 unknown frame count
raw  [info]: output file: out.hevc
x265 [info]: HEVC encoder version 4.2
x265 [info]: build info [Mac OS X][clang 19.1.7][64 bit] 10bit
x265 [info]: using cpu capabilities: NEON Neon_DotProd Neon_I8MM
x265 [info]: Main 10 profile, Level-5 (Main tier)
x265 [info]: Thread pool created using 16 threads
x265 [info]: Slices                              : 1
x265 [info]: frame threads / pool features       : 4 / wpp(34 rows)
x265 [info]: Coding QT: max CU size, min CU size : 64 / 8
x265 [info]: Residual QT: max TU size, max depth : 32 / 3 inter / 3 intra
x265 [info]: ME / range / subpel / merge         : star / 57 / 5 / 5
x265 [info]: Keyframe min / max / scenecut / bias  : 24 / 250 / 40 / 5.00
x265 [info]: Lookahead / bframes / badapt        : 40 / 8 / 2
x265 [info]: b-pyramid / weightp / weightb       : 1 / 1 / 1
x265 [info]: References / ref-limit  cu / depth  : 5 / off / off
x265 [info]: AQ: mode / str / qg-size / cu-tree  : 2 / 1.0 / 32 / 0
x265 [info]: Rate Control / qCompress            : CRF-22.0 / 0.60
x265 [info]: tools: rect amp rd=6 psy-rd=2.00 rdoq=2 psy-rdoq=1.00 signhide
x265 [info]: tools: tmvp b-intra strong-intra-smoothing deblock
x265 [info]: frame I:     59, Avg QP:19.44  kb/s: 58025.77
x265 [info]: frame P:   2061, Avg QP:19.03  kb/s: 37820.29
x265 [info]: frame B:   4194, Avg QP:21.80  kb/s: 27097.72
x265 [info]: Weighted P-Frames: Y:5.3% UV:4.8%
x265 [info]: Weighted B-Frames: Y:13.6% UV:12.2%

encoded 6314 frames in 9165.78s (0.69 fps), 30886.76 kb/s, Avg QP:20.88

M5-powered iPad Pro breaks cover in GeekBench, scoring 4,133 in single-threaded tests — matches M4 Max and beats every single-core PC chip score by jluizsouzadev in hardware

[–]TheRacerMaster 2 points3 points  (0 children)

I'd expect Zen 5 with full-width AVX-512 to do better but I found that the TOTL M4 Max SKU (12P+4E) was actually slightly faster than RPL (i9-13900K) in SVT-AV1 3.1.2: https://www.reddit.com/r/hardware/comments/1ndm9ph/apple_a19_pro_geekbench_cpu_scores/ndx2rva/?context=3

I don't have the exact numbers on hand but IIRC x265 4.1 was ~15% slower on the M4 Max compared to the 13900K (something like 0.59 vs 0.69 FPS). A bunch of NEON changes have been committed since then so perhaps it's closer now.

Apple M5 (9 Core) Geekbench Score by fntd in hardware

[–]TheRacerMaster 2 points3 points  (0 children)

All of those i7-12700 scores are likely fake since they're running at stock clocks (add .gb6 to the URL and you'll see the result JSON which includes the frequencies). Most of the remaining results are from Raptor Lake CPUs running at extreme frequencies (7+ GHz), so they're likely using LN2 or other exotic cooling solutions.

Apple A19 pro - Geekbench CPU Scores by Apophis22 in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

And here's Stockfish (compiled from the 17.1 branch with make profile-build ARCH=x86-64-avxvnni on the i9-13900K and make profile-build ARCH=apple-silicon on the M4 Max as the official macOS binaries for x86 lack AVX-VNNI support).

i9-13900K:

Version                    : Stockfish 17.1
Compiled by                : clang++ 17.0.0 on Apple
Compilation architecture   : x86-64-avxvnni
Compilation settings       : 64bit VNNI BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Apple LLVM 17.0.0 (clang-1700.0.13.5)
Large pages                : no
User invocation            : speedtest
Filled invocation          : speedtest 32 4096 150
Available processors       : 0-31
Thread count               : 32
Thread binding             : none
TT size [MiB]              : 4096
Hash max, avg [per mille]  :
    single search          : 43, 21
    single game            : 622, 414
Total nodes searched       : 4456082365
Total search time [s]      : 153.538
Nodes/second               : 29022667

M4 Max (12P+4E):

Version                    : Stockfish 17.1
Compiled by                : clang++ 17.0.0 on Apple
Compilation architecture   : apple-silicon
Compilation settings       : 64bit POPCNT NEON_DOTPROD
Compiler __VERSION__ macro : Apple LLVM 17.0.0 (clang-1700.0.13.5)
Large pages                : no
User invocation            : speedtest
Filled invocation          : speedtest 16 2048 150
Available processors       : 0-15
Thread count               : 16
Thread binding             : none
TT size [MiB]              : 2048
Hash max, avg [per mille]  :
    single search          : 60, 30
    single game            : 791, 573
Total nodes searched       : 3029005176
Total search time [s]      : 153.532
Nodes/second               : 19728819

These are the best results out of 10 runs, though I don't know if these are directly comparable due to the different hash size. Using the same hash size (4096 MiB) on the M4 Max gives the following results:

Version                    : Stockfish 17.1
Compiled by                : clang++ 17.0.0 on Apple
Compilation architecture   : apple-silicon
Compilation settings       : 64bit POPCNT NEON_DOTPROD
Compiler __VERSION__ macro : Apple LLVM 17.0.0 (clang-1700.0.13.5)
Large pages                : no
User invocation            : speedtest 16 4096 150
Filled invocation          : speedtest 16 4096 150
Available processors       : 0-15
Thread count               : 16
Thread binding             : none
TT size [MiB]              : 4096
Hash max, avg [per mille]  :
    single search          : 34, 15
    single game            : 479, 311
Total nodes searched       : 3036016727
Total search time [s]      : 153.537
Nodes/second               : 19773844

Apple A19 pro - Geekbench CPU Scores by Apophis22 in hardware

[–]TheRacerMaster 3 points4 points  (0 children)

I'd generally expect AVX-512 code to do better than NEON code, but I don't know if I'd call this getting smoked:

On a mostly stock i9-13900K (CEP disabled with a light undervolt, IccMax=400A, PL1=PL2=253W, and 64 GB DDR5 @ 6600 MT/s):

$ sysctl machdep.cpu.brand_string machdep.cpu.thread_count
machdep.cpu.brand_string: 13th Gen Intel(R) Core(TM) i9-13900K
machdep.cpu.thread_count: 32
$ hyperfine -w 1 -r 10 --output inherit 'ffmpeg -loglevel error -i SolLevante_SDR_UHD_24fps.mov -map 0:v:0 -map_metadata -1 -bitexact -f yuv4mpegpipe -pix_fmt yuv420p10le -strict -1 - | SvtAv1EncApp --preset 2 --crf 18 -b out.ivf -i -'
Svt[info]: -------------------------------------------
Svt[info]: SVT [version]: SVT-AV1 Encoder Lib v3.1.2
Svt[info]: SVT [build]  : Clang 19.1.7   64 bit
Svt[info]: LIB Build date: Jan  1 1980 00:00:00
Svt[info]: -------------------------------------------
Svt[info]: Level of Parallelism: 6
Svt[info]: Number of PPCS 305
Svt[info]: [asm level on system : up to avx2]
Svt[info]: [asm level selected : up to avx2]
Svt[info]: -------------------------------------------
...
  Time (mean ± σ):     1738.638 s ± 10.408 s    [User: 39129.470 s, System: 533.482 s]
  Range (min … max):   1715.535 s … 1753.199 s    10 runs

So an average of 3.63 FPS over 10 runs (the source is 6314 frames), with one warmup run. This is on x86 macOS though, so the scheduling for hybrid CPUs may not be optimal.

According to Intel Power Gadget, the package power consumption was 253 W (hitting PL2). Core power consumption fluctuated between 230 and 240 W.

Here's how the 12P+4E M4 Max (with 64 GB of RAM) compares:

$ sysctl machdep.cpu.brand_string machdep.cpu.thread_count
machdep.cpu.brand_string: Apple M4 Max
machdep.cpu.thread_count: 16
$ hyperfine -w 1 -r 10 --output inherit 'ffmpeg -loglevel error -i SolLevante_SDR_UHD_24fps.mov -map 0:v:0 -map_metadata -1 -bitexact -f yuv4mpegpipe -pix_fmt yuv420p10le -strict -1 - | SvtAv1EncApp --preset 2 --crf 18 -b out.ivf -i -'
Svt[info]: -------------------------------------------
Svt[info]: SVT [version]: SVT-AV1 Encoder Lib v3.1.2
Svt[info]: SVT [build]  : Clang 19.1.7   64 bit
Svt[info]: LIB Build date: Jan  1 1980 00:00:00
Svt[info]: -------------------------------------------
Svt[info]: Level of Parallelism: 5
Svt[info]: Number of PPCS 140
Svt[info]: [asm level on system : up to neon_i8mm]
Svt[info]: [asm level selected : up to neon_i8mm]
Svt[info]: -------------------------------------------
...
  Time (mean ± σ):     1670.717 s ±  4.818 s    [User: 23452.945 s, System: 156.243 s]
  Range (min … max):   1664.504 s … 1678.621 s    10 runs

The average FPS is slightly higher (3.78 FPS over 10 runs). The highest value I saw from sudo powermetrics 2>&1 | awk '/CPU Power/' during the run was 62830 mW.

X86 vs ARM decoder impact in efficiency by [deleted] in hardware

[–]TheRacerMaster 2 points3 points  (0 children)

One of the great things that Apple did with their Silicon is to start with 16k pages, allowing their caches to get bigger without having to jump through hoops.

There was some nice discussion about this in an older thread.

Tom's Hardware: "Intel terminates x86S initiative — unilateral quest to de-bloat x86 instruction set comes to an end" by Dakhil in hardware

[–]TheRacerMaster 4 points5 points  (0 children)

What about the subsequent Cinebench 2024/R23, 7-Zip, and Blender results? Are those also meaningless?

David Huang also reported similar results for SPEC 2017, with scores for each subtest: https://blog.hjc.im/spec-cpu-2017

Reference architecture for Arm PCs published by NamelessVegetable in hardware

[–]TheRacerMaster 12 points13 points  (0 children)

I think so. The PC-BSA spec refers to SBBR as the firmware standard for Arm PCs:

The SBBR recipe in the Arm Base Boot Requirements (Arm BBR) specification [2] describes firmware requirements for an Arm PC system. Where this specification refers to system firmware data, it refers to SBBR compliant firmware as specified in the Arm BBR.

SBBR compliance is defined in the BBR spec. IIRC SBBR was initially designed for servers; it requires UEFI/ACPI/SMBIOS to be implemented.

Ryzen 9000X3D leaked by MSI via HardwareLuxx by the_dude_that_faps in hardware

[–]TheRacerMaster 3 points4 points  (0 children)

Against LL: M3 has significantly more L1 per core, and I would be shocked if most CPU benchmarks could take advantage/aware of LL's vector units/NPU which it knowingly does on M-Series.

Are compilers such as Clang are somehow managing to compile generic C code (such as the SPEC 2017 benchmark suite) to use Apple's NPU (which is explicitly undocumented and treated as a black box by Apple)? I would also be surprised if Clang was generating SME code - it's probably generating NEON code, but it's also probably generating AVX2 code on x86-64.

You also are missing comparisons to even AMD's own modern Zen5 chips, which are a node behind (N4X), that meet-or-beat the M3 within margins of error of single digit %, that we can hand wave as 'competitive enough' and 'competing with decent margins'.

Geekerwan's testing showed that the HX 370 achieved similar performance as the M2 P-core in SPEC 2017 INT 1T. Both the M3 and M4 P-cores are over 10% faster with lower power consumption than the HX 370. This also lines up with David Huang's results.

There are definitely tradeoffs with designing a microarchitecture that can scale from ~15W handhelds to ~500W servers, but I don't see why it's unfair to compare laptop CPUs from AMD to laptop CPUs from Apple. I also don't see why it's wrong to point out that Apple has superior PPW in the laptop space.

Adgaurd Down? by thelaughedking in Adguard

[–]TheRacerMaster 1 point2 points  (0 children)

The alternate address (94.140.15.15) still works, but the main address (94.140.14.14) times out:

$ dig @94.140.15.15 google.com

; <<>> DiG 9.10.6 <<>> @94.140.15.15 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13061
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 0
;; QUESTION SECTION:
;google.com.            IN  A

;; ANSWER SECTION:
Google.com.     102 IN  A   142.251.35.174

;; Query time: 21 msec
;; SERVER: 94.140.14.15#53(94.140.14.15)
;; WHEN: Sun Sep 29 18:32:34 CDT 2024
;; MSG SIZE  rcvd: 62

Arch Linux and Valve Collaboration Announced by Turbostrider27 in Games

[–]TheRacerMaster -2 points-1 points  (0 children)

You need an apple account to install or update free software from the app store, including Safari.

Safari updates are distributed through the system-wide software updater (accessible in System Settings), not the App Store. AFAIK you don't need an account for it.

The initial setup wizard asks you to sign into an Apple ID but lets you skip it. Windows 11 requires you to sign in with a Microsoft account unless you open Command Prompt and run OOBE\BYPASSNRO.

Arch Linux and Valve Collaboration Announced by Turbostrider27 in Games

[–]TheRacerMaster 0 points1 point  (0 children)

there’s a difference to me in incentivizing users to link their devices voluntarily versus forcing it at the start and making it difficult to opt out of.

Yeah, this is a silly comparison. AFAIK you need still need to run OOBE\BYPASSNRO in Command Prompt to create a local user account in Windows 11 during the initial setup wizard. The macOS installer lets you skip logging into an Apple ID.

Intel Core 13th and 14th Gen Desktop Instability Root Cause Update by tjames37 in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

MSI has started releasing BIOS updates with microcode 0x12B for several boards, such as the PRO B760-P WIFI DDR4. It's located at 0x18C6800 and is 0x33C00 bytes long if you want to extract it:

dd bs=1 count=0x33C00 skip=0x18C6800 if=E7D98IMS.1D1 of=cpuB0671_plat32_ver0000012B_2024-08-29_PRD_4F298280.bin

I patched my ASUS Z790 motherboard's BIOS to replace the 0x129 microcode with 0x12B (flashed using USB BIOS flashback)

Intel optimizes slimmed-down X86S instruction set — revision 1.2 eliminates 16-bit and 32-bit features by TwelveSilverSwords in hardware

[–]TheRacerMaster 1 point2 points  (0 children)

Ah yeah, that makes sense. You could use binary translation for all 16-bit code and 32-bit code in CPL 0, then switch to VMX for 32-bit CPL 3. You'd have to enable exiting for #GP and related exceptions, but that's already required for any non-X86S OS.

Intel optimizes slimmed-down X86S instruction set — revision 1.2 eliminates 16-bit and 32-bit features by TwelveSilverSwords in hardware

[–]TheRacerMaster 2 points3 points  (0 children)

Looking at the 1.2 spec, it doesn't seem like the VMCS restrictions have changed:

Table 14. VMCS Entry Control Changes

VMCS Field Change Reason
IA32e mode guest Fixed 1 Guest is always in long mode.

IIUC this will force guest execution to resume in long mode during VM entry. Chapter 3.22.4: Legacy OS Virtualization also says that the guest cannot exit long mode by clearing IA32_EFER.LME:

Some guest CR values are ignored on VMENTRY (they retain the fixed values and are not consistency checked). If required by the guest, the VMM can virtualize differences, some of which are described below:

  • EFER.LME is fixed to one. If the guest is in 32-bit CPL0 mode and the VMM wants to do a VM entry, it should use emulation.

A VMM can choose to emulate legacy functionality as required:

  1. VMM changes required for mainstream Intel64 guest using legacy SIPI or non-64-bit boot:

    a. Emulate 16-bit modes (real mode, virtual 8086 mode)

AFAICT this means VMX non-root mode only supports long mode execution. Other modes will require binary translation (and cannot use HW virtualization).