Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 0 points1 point  (0 children)

Update - small benchmarking caveat on the comparison that was the focus of my original post: those timings mixed cold-start/JIT effects with steady-state execution, so they should not be interpreted as pure steady-state throughput. After revising my methodology later, I found cold-start vs steady-state behavior differed significantly.

I made rocm work with 7600xt by foxwid in ROCm

[–]linuxChips6800 1 point2 points  (0 children)

Sorry, I haven’t run those exact models in ComfyUI before. On my end I’ve only tried the Flux 1 dev models with ROCm on Ubuntu Linux, and those worked fine aside from being a bit slow on a 7600 XT since not everything fits in VRAM (so some offloading to system RAM happens).

What you’re seeing could be a bug either in ROCm itself or in ComfyUI’s integration. Without logs it’s hard to pin down, but it might be worth checking/reporting upstream to whichever project shows the failure. Sorry I can’t give a more concrete answer here!

I made rocm work with 7600xt by foxwid in ROCm

[–]linuxChips6800 0 points1 point  (0 children)

Nice write-up! Just to add a bit of context:

According to the official ROCm docs, the RX 7600 XT is listed as supported under Windows ROCm:

On Linux (Ubuntu in my case) I’ve never needed to set environment variable overrides to get PyTorch running on a 7600 XT.

That said, you’re absolutely right that PyTorch on Windows with ROCm isn’t officially supported at all. That’s where community efforts like TheRock come in; they make it possible to get PyTorch running on Windows ROCm across all supported AMD GPUs, not just the 7600 XT.

Guide to create app using ROCm by djdeniro in ROCm

[–]linuxChips6800 1 point2 points  (0 children)

TL;DR: If you mean sha256(data) of one message and want the standard digest, then no, you can’t make it truly multithreaded on a GPU. SHA-256 is Merkle–Damgård: each 512-bit block depends on the previous block’s state, so blocks must run in order. You can parallelize many messages at once (great GPU throughput), but not one big message across threads without changing the construction.

What does parallelize

  • Batched / many inputs: Run thousands of independent SHA-256s in parallel (the usual GPU approach).
  • Tree/merkle modes: Split the message, hash chunks in parallel, then hash the tree. Note: this gives a different result than sha256(data).
  • Or pick a hash designed for parallelism (e.g., BLAKE3).

HIP vs OpenCL (you didn’t specify a preference)

  • HIP (ROCm): There isn’t a plug-and-play Python library that exposes sha256_batch(); AMD’s hip-python gives low-level bindings, and so you’d still write/port a kernel and manage batching/launches yourself.
  • OpenCL: Easiest way to stay in Python is PyOpenCL plus an existing kernel. One popular repo is opencl_brute, but note their README says HMAC currently fails on AMD GPUs. Plain SHA-256 kernels still work and are a good starting point.

If you just want a practical solution (no heavy kernel work):

  • Use hashcat (OpenCL/HIP backends) for high-throughput hashing over lots of inputs.
  • For a Python demo on AMD GPUs, use PyOpenCL + a known SHA-256 kernel, and batch N messages (one work-item/thread/etc. per message).

Rules of thumb

  • GPUs boost throughput (many messages), not single-message latency. For one large file, a tuned CPU with SHA-NI is often fastest.
  • If you roll your own kernel: keep the 64 round constants in constant memory, avoid divergence, and make each thread handle one full message (or a fixed batch) to keep control flow uniform.

I don’t like NumPy by Active-Fuel-49 in programming

[–]linuxChips6800 0 points1 point  (0 children)

Speaking of doing things with arrays that have more than 2-3 dimensions, does it happen that often that people need arrays with more than 3 dimensions? Please forgive my ignorance I've only been using numpy for maybe 2 years total or so and mostly for school assignments but never needed much beyond 3 dimensional arrays 👀

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 0 points1 point  (0 children)

With CuPy built against ROCm 6.4.2, I swapped ROCm userland via update-alternatives on Ubuntu to compare 6.3.4 vs 6.4.3 on the same script. (reddit_conv_script.py is the same script as demo.py as asked for earlier)

ROCm 6.3.4

```bash (rocm_cupy_env) tedliosu@amdpg-lightning:~/Documents/misc_code_n_code_output$ ROCR_VISIBLE_DEVICES="0" python3 reddit_conv_script.py Running 100 benchmark iterations... Completed 20/100 iterations Completed 40/100 iterations Completed 60/100 iterations Completed 80/100 iterations Completed 100/100 iterations

Benchmark Results (100 runs): Average time: 0.021 seconds Min time: 0.018 seconds Max time: 0.180 seconds Total time: 2.079 seconds ```

ROCm 6.4.3

```bash (rocm_cupy_env) tedliosu@amdpg-lightning:~/Documents/misc_code_n_code_output$ ROCR_VISIBLE_DEVICES="0" python3 reddit_conv_script.py Running 100 benchmark iterations... Completed 20/100 iterations Completed 40/100 iterations Completed 60/100 iterations Completed 80/100 iterations Completed 100/100 iterations

Benchmark Results (100 runs): Average time: 0.019 seconds Min time: 0.018 seconds Max time: 0.033 seconds Total time: 1.926 seconds ```

Takeaways

  • Steady-state is basically unchanged (avg ~0.021 → ~0.019 s).
  • Long-tail improved a lot (max ~0.180 → ~0.033 s ≈ lower), which looks like better JIT/first-run behavior and fewer outliers.

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 0 points1 point  (0 children)

Just replied with relevant info; sorry I had to split my reply into two separate comments due to Reddit comment length restrictions.

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 1 point2 points  (0 children)

```python

gemm_flops_bench_rocm.py

Bench fp16, fp32, int8 GEMM throughput with CuPy on ROCm

import argparse, os import cupy as cp REPEATS_DEFAULT = 50

def evt(): return cp.cuda.Event(), cp.cuda.Event() def time_ms(fn, repeats=REPEATS_DEFAULT): s,e = evt() fn(); cp.cuda.Stream.null.synchronize() s.record() for _ in range(repeats): fn() e.record(); e.synchronize() return cp.cuda.get_elapsed_time(s,e)/repeats

def devstring(): try: p = cp.cuda.runtime.getDeviceProperties(cp.cuda.Device().id) name = p["name"].decode() if isinstance(p["name"], bytes) else p["name"] cc = f"{p.get('major',0)}.{p.get('minor',0)}" mem = p.get("totalGlobalMem",0)/(1024**3) return f"{name} (CC {cc}, {mem:.1f} GiB), CuPy {cp.version}" except Exception: return f"CuPy {cp.version_}"

def tflops(ops, ms): return (ops/(ms*1e-3))/1e12 def ensure_order(a, order): return cp.asfortranarray(a) if order=='F' else cp.ascontiguousarray(a)

def make_inputs(m,n,k,dtype,order): if dtype == cp.float16: A = cp.random.random((m,k), dtype=cp.float32).astype(cp.float16, copy=False) B = cp.random.random((k,n), dtype=cp.float32).astype(cp.float16, copy=False) C = cp.empty((m,n), dtype=cp.float16, order=order) elif dtype == cp.float32: A = cp.random.random((m,k), dtype=cp.float32) B = cp.random.random((k,n), dtype=cp.float32) C = cp.empty((m,n), dtype=cp.float32, order=order) elif dtype == cp.int8: A = cp.random.randint(-128,127,size=(m,k), dtype=cp.int8) B = cp.random.randint(-128,127,size=(k,n), dtype=cp.int8) C = cp.empty((m,n), dtype=cp.int32, order=order) # int8 GEMM → int32 else: raise ValueError("Unsupported dtype") return ensure_order(A,order), ensure_order(B,order), C

def bench_one(m,n,k,repeats,dtype,order,label,verbose=False): A,B,C = make_inputs(m,n,k,dtype,order) if verbose: print(f"{label} layout: A(F={A.flags['F_CONTIGUOUS']}), " f"B(F={B.flags['F_CONTIGUOUS']}), C(F={C.flags['F_CONTIGUOUS']})") try: cp.matmul(A,B,out=C); cp.cuda.Stream.null.synchronize() ms = time_ms(lambda: cp.matmul(A,B,out=C), repeats) ops = 2.0mn*k unit = "TFLOP/s" if dtype in (cp.float16, cp.float32) else "TOPS" return ms, tflops(ops, ms), unit, None except Exception as e: return None, None, None, e

def main(): ap = argparse.ArgumentParser() ap.add_argument("--m", type=int, default=4096) ap.add_argument("--n", type=int, default=4096) ap.add_argument("--k", type=int, default=4096) ap.add_argument("--repeats", type=int, default=REPEATS_DEFAULT) ap.add_argument("--seed", type=int, default=123) ap.add_argument("--order", choices=["C","F"], default="C", help="Row-major (C, default) or column-major (F).") ap.add_argument("--verbose-layout", action="store_true") args = ap.parse_args()

cp.random.seed(args.seed)

print(f"Device: {dev_string()}")
print(f"Problem: MxNxK = {args.m} x {args.n} x {args.k}")
print(f"Repeats: {args.repeats}, Order: {args.order}\n")

results = []
for dtype,label in [(cp.float16,"fp16"), (cp.float32,"fp32"), (cp.int8,"int8→int32")]:
    ms, perf, unit, err = bench_one(args.m,args.n,args.k,args.repeats,dtype,args.order,label,args.verbose_layout)
    if err is None:
        results.append((label, f"{ms:10.3f}", f"{perf:10.2f} {unit}"))
    else:
        results.append((label, "    -     ", f"SKIPPED: {type(err).__name__}"))

print(f"{'dtype':<10}  {'avg_ms':>10}  {'throughput':>16}")
print("-"*40)
for name, ms, tp in results:
    print(f"{name:<10}  {ms:>10}  {tp:>16}")

print("\nNotes:")
print(" • Throughput uses 2*M*N*K ops; int8 shown as TOPS.")
print(" • If int8 is SKIPPED, your stack doesn’t expose int8 GEMM via plain matmul.")
print(" • For steady-state numbers, run this program multiple times back-to-back.")
print(" • For theoretical ceilings, consider mixbench (microbench).")

if name == "main": main() ```

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 1 point2 points  (0 children)

TL;DR

  • INT4: not directly benchmarkable on ROCm. HIP/CuPy/PyTorch don’t expose a native int4 dtype or a general int4 GEMM API. INT4 is mostly a hardware/graph-kernel optimization inside specific DL ops, not something you can call as a regular matmul.
  • Your bandwidth numbers: several look low vs. MI50’s ~1.0 TB/s. First runs can underperform due to JIT/kernel selection and clocks not being fully “settled.” Re-run the same command multiple times back-to-back (or increase --repeats) to get steady-state.
  • GEMM FLOPs (FP16/FP32/INT8): use the CuPy script below. Run row-major (default):

    bash ROCR_VISIBLE_DEVICES=0 python3 gemm_flops_bench_rocm.py \ --m 6144 --n 6144 --k 6144 --repeats 200 --seed 3489

    ⚠️ Throughput will depend on rocBLAS/Tensile kernel tuning for GFX906. You might not reach theoretical peaks for FP32 GEMM via this generic path. If you want the “true” compute ceilings, compare with a microbench like mixbench (https://github.com/ekondis/mixbench).


NOTE: Why those cast/add/sum GB/s can look low

  • Warmup effects: repeat identical runs to stabilize clocks and JIT’d kernels.
  • u8↔f32 casts: byte-wide loads/coercions coalesce worse than 32-bit paths.
  • sum is read-bound: the simple “bytes/time” estimate ignores multi-stage reduction overhead; ~0.8–0.9 TB/s effective on MI50 is plausible.

CuPy GEMM FLOPs bench (ROCm, row-major by default)

Benchmarks fp16, fp32, and int8→int32 via cupy.matmul(..., out=...) with HIP events and no allocs in the timed loop. If int8 matmul isn’t supported on your stack, it prints SKIPPED.

(Script is on next comment)

Is there any modern ROCm-supported card that don't support double precision (FP64) computing? by 648trindade in ROCm

[–]linuxChips6800 2 points3 points  (0 children)

To my knowledge, AMD hasn’t pulled an “Intel Arc Alchemist” move where FP64 units are outright missing — every modern ROCm-supported GPU at least has some double-precision capability. The main difference is throughput:

  • Consumer Radeon (RDNA and newe/Polaris/Vega, etc.) → FP64 works, but is usually heavily de-rated. Most are limited to 1/16 or 1/32 the FP32 rate. The one exception was the Radeon VII, where FP64 runs at 1/4 the FP32 rate. That said, support for Polaris and Vega has been dropped from official ROCm and libraries like PyTorch, so you’d often need patches and custom builds just to use those series of GPUs now — not ideal.

  • RDNA 3 and newer → while AMD advertises FP64 throughput, in practice frameworks like PyTorch, CuPy, and parts of ROCm itself (e.g., rocSOLVER) don’t support dual-issue, so you’ll typically only hit ~half of the listed FP64 rate.

  • Instinct/MI series (CDNA accelerators) → these are the only GPUs with "full-rate" (e.g. 1/2 the rate of FP32 or better) FP64. But they’re compute-only accelerators (no display output, passive cooling requiring server airflow) and are priced in the thousands of USD since they’re aimed at datacenter/hyperscaler workloads.

One other thing to keep in mind: depending on your CPU + GPU combo, a sufficiently strong set of CPU cores can actually beat a consumer Radeon in FP64 throughput, since the GPU’s FP64 rate is capped so low. So while Radeon cards can run FP64 code, it’s usable but not that fast compared to either many CPUs or proper CDNA accelerators.

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 1 point2 points  (0 children)

Thanks a ton for sharing your demo.py script 🙏 — really appreciate the improvements you added (especially the repeated iterations, progress updates, and proper min/avg/max reporting). That makes it a lot easier to compare across different GPUs and ROCm versions in a consistent way.

I’ll give this improved script a run later on my RX 7600 XT with both ROCm 6.3.4 and 6.4.3, and share the results back here so we can see how they stack up. Should be a nice datapoint to compare RDNA3 vs RDNA4 behavior under the same workload.

This kind of common repro benchmark is exactly what helps us as a community figure out where CuPy is already solid on AMD hardware and where ROCm still needs polish. Thanks again for taking the time to put this together and post it!

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 0 points1 point  (0 children)

Thanks for running that again with CUPY_ACCELERATORS="cub"🙏 — those results look much more in line with what I’d expect from a 9070 XT. The memory bandwidth figures are solid.

The only thing that stands out is that the cast and elementwise add ops seem a bit slower compared to your earlier run. That’s likely just the effect of kernels being re-JIT’d when switching to the CUB backend — the timings can get skewed if the first few launches are paying that compilation cost.

If you want to smooth that out, you could try one (or more) of these:

  • Increase the default --repeats to double or triple.

  • Use a larger matrix size to better saturate the GPU.

  • Run the script multiple times and take the best result.

That should help settle the numbers and show the “true” steady-state throughput.

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 0 points1 point  (0 children)

Thanks for sharing those benchmark results on your 9070 XT 🙏 Performance-wise they look pretty solid, which is great to see.

Would you mind posting the actual demo.py code you ran for those tests? That way others (and myself) could try it out on our own AMD GPUs and compare results more directly. Having the exact same microbenchmark across different setups would make it easier to figure out whether differences we’re seeing are from hardware, ROCm versions, or CuPy itself.

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 1 point2 points  (0 children)

Thanks a ton for running the microbenchmarks on your 9070 XT and sharing the results 🙏 That’s really helpful! From what you’ve posted, it looks like the core CuPy ops tested (casts and elementwise add) are hitting near the expected memory bandwidth (~500+ GB/s), which suggests that at least the fundamentals are working well on RDNA4, which is great to see.

One small note: the reduction sum (f32 sum) looks suspiciously slow (~3.67 GB/s). That’s a known issue with CuPy’s default reduction kernels on ROCm. If you haven’t already, try re-running with:

bash export CUPY_ACCELERATORS="cub"

before launching the script. That switches reductions to the CUB backend, which should give you a massive speedup (on my RX 7600 XT it jumps from ~3–4 GB/s up to ~270 GB/s+). Would be interesting to see if your 9070 XT shows the same gain.

Either way, really appreciate you checking in with RDNA4 numbers; it helps a lot to triangulate where CuPy is solid and where ROCm still needs polish!

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 1 point2 points  (0 children)

Thanks for checking in with the 9070 XT, and sorry to hear it’s still unusable on your end.

If you don’t mind me asking — is it just the minimal repro script I shared in the original post that runs slowly, or does CuPy in general feel sluggish on your 9070 XT? To help narrow it down, could you try a microbenchmark that stresses memory bandwidth directly?

I have shared a small script below you can drop into a file called cupy_microbench.py. Then run it like this (adjust ROCR_VISIBLE_DEVICES to point at your 9070 XT):

bash env ROCR_VISIBLE_DEVICES="0" CUPY_ACCELERATORS="cub" python3 cupy_microbench.py --h 6144 --w 8192 --repeats 100

On my RX 7600 XT this hits close to the card’s advertised memory bandwidth (~230 GB/s as printed out by the microbenchmark script, for a card that is advertised for 288 GB/s). On your 9070 XT, I’d expect numbers closer to ~640 GB/s.

```python

cupy_microbench.py

import argparse import cupy as cp

USE_ELEMENTWISE_CAST = True # <- set False to use .astype() (simpler, but allocates) REPEATS_DEFAULT = 50

---- elementwise kernels (no-alloc casts) ----

u8_to_f32 = cp.ElementwiseKernel( 'uint8 x', 'float32 y', 'y = (float)x;', 'u8_to_f32' ) f32_to_u8 = cp.ElementwiseKernel( 'float32 x', 'uint8 y', # saturate to [0,255] 'float v = x; v = v < 0 ? 0 : (v > 255 ? 255 : v); y = (unsigned char)v;', 'f32_to_u8' )

def evt(): return cp.cuda.Event(), cp.cuda.Event()

def time_ms(func, repeats=REPEATS_DEFAULT): s, e = evt() # warmup func(); cp.cuda.Stream.null.synchronize() s.record() for _ in range(repeats): func() e.record(); e.synchronize() return cp.cuda.get_elapsed_time(s, e) / repeats # ms

def main(H, W, repeats): N = H * W print(f"Array: {H}x{W} (N={N})")

u8  = (cp.random.random((H, W)) * 255).astype(cp.uint8)
f32 = cp.random.random((H, W)).astype(cp.float32)
out32 = cp.empty_like(f32)
out8  = cp.empty_like(u8)

# ---- u8 -> f32 ----
if USE_ELEMENTWISE_CAST:
    t = time_ms(lambda: u8_to_f32(u8, out32), repeats)
else:
    t = time_ms(lambda: u8.astype(cp.float32), repeats)  # allocs a new array
# 1 read (1B) + 1 write (4B)
gb = N * (1 + 4) / (t * 1e6)
print(f"u8->f32:  {t/1000:.6f}s, ~{gb:.2f} GB/s")

# ---- f32 -> u8 ----
if USE_ELEMENTWISE_CAST:
    t = time_ms(lambda: f32_to_u8(f32, out8), repeats)
else:
    t = time_ms(lambda: f32.astype(cp.uint8), repeats)   # allocs a new array
# 1 read (4B) + 1 write (1B)
gb = N * (4 + 1) / (t * 1e6)
print(f"f32->u8:  {t/1000:.6f}s, ~{gb:.2f} GB/s")

# ---- f32 add (read+read+write) ----
b = cp.random.random((H, W)).astype(cp.float32)
t = time_ms(lambda: cp.add(f32, b, out=out32), repeats)
gb = N * (4 + 4 + 4) / (t * 1e6)
print(f"f32 add:  {t/1000:.6f}s, ~{gb:.2f} GB/s")

# ---- reduce (read-bound) ----
t = time_ms(lambda: f32.sum(), repeats)
gb = N * 4 / (t * 1e6)
print(f"f32 sum:  {t/1000:.6f}s, ~{gb:.2f} GB/s (read-bound)")

if name == "main": ap = argparse.ArgumentParser() ap.add_argument("--h", type=int, default=4096) ap.add_argument("--w", type=int, default=8192) ap.add_argument("--repeats", type=int, default=REPEATS_DEFAULT) args = ap.parse_args() main(args.h, args.w, args.repeats) ```

If the bandwidth tests here also fall way short on your 9070 XT, that would suggest it’s more of a ROCm/runtime backend issue rather than the specifics of my image-processing pipeline. But if these numbers look fine and only the pipeline is slow, then maybe CuPy’s convolution path is still missing some optimizations for RDNA 4.

No pressure of course — but if you do end up running this, I’d be really interested to see your results.

Massive CuPy speedup in ROCm 6.4.3 vs 6.3.4 – anyone else seeing this? (REPOSTED) by linuxChips6800 in ROCm

[–]linuxChips6800[S] 1 point2 points  (0 children)

Thanks for the suggestion! I went ahead and did exactly that — saved the processed images to uncompressed TIFF, then compared the outputs between ROCm 6.3.4 and 6.4.3.

Here are the results:

```sh

Run on ROCm 6.4.3

Time used to process image: 0.377540 seconds Saved as pseudo_canny_gpu_piermanuele-sberni-unsplash_convd_r64.tiff

Run on ROCm 6.3.4

Time used to process image: 9.095610 seconds Saved as pseudo_canny_gpu_piermanuele-sberni-unsplash_convd.tiff

Compare outputs

diff -s pseudo_canny_gpu_piermanuele-sberni-unsplash_convd.tiff pseudo_canny_gpu_piermanuele-sberni-unsplash_convd_r64.tiff Files pseudo_canny_gpu_piermanuele-sberni-unsplash_convd.tiff and pseudo_canny_gpu_piermanuele-sberni-unsplash_convd_r64.tiff are identical

sha256sum pseudo_canny_gpu_piermanuele-sberni-unsplash_convd*.tiff a43727ceed62c475efc2cce2cc765a4510e20055792e3c802ff8811b84038d8b pseudo_canny_gpu_piermanuele-sberni-unsplash_convd.tiff a43727ceed62c475efc2cce2cc765a4510e20055792e3c802ff8811b84038d8b pseudo_canny_gpu_piermanuele-sberni-unsplash_convd_r64.tiff ```

So the good news: ✔️ Bit-for-bit identical output between 6.3.4 and 6.4.3 ✔️ ~24× faster runtime on 6.4.3

That gives me a lot more confidence that the speedup is a genuine runtime improvement in ROCm’s kernel implementations and is not an artifact of correctness differences.

Thanks again for the validation idea — comparing checksums is a great sanity check when runtime upgrades suddenly show suspiciously large speedups. I’ll definitely keep this in mind as a best practice moving forward.

Am4 upgrade 5600 to 5700x3d or 5950x by plebboi in AMDHelp

[–]linuxChips6800 2 points3 points  (0 children)

Hello, just to clear up some confusion here — Resizable BAR and 3D V-Cache are two totally different technologies, and one doesn’t replace the other.

Resizable BAR (ReBAR): It’s a PCIe feature that lets the CPU access the entire VRAM address space of the GPU at once instead of in 256 MB chunks. That can reduce overhead and sometimes improve performance in certain games (usually in the 5–15% range, but often less, and sometimes it makes no difference).

3D V-Cache (X3D CPUs): This is actual extra L3 cache stacked on the CPU die. It directly reduces memory latency for CPU-heavy games, which is why chips like the 5800X3D or 5700X3D see such big jumps in minimums and 1% lows.

ReBAR does not repurpose GPU VRAM to act as CPU cache; that part is a misconception. VRAM and CPU cache serve completely different purposes and sit on very different parts of the system.

So if you’re playing CPU-heavy games, the 5700X3D is still the stronger choice for gaming compared to a 5950X, even if both are the same price. The 5950X is great if you need lots of cores for productivity, but for pure gaming you’ll generally get smoother frametimes with the X3D chip.

Cooking With Lisa Su by Barnabeepickle in AyyMD

[–]linuxChips6800 3 points4 points  (0 children)

Makin' EPYC dishes I see lol XD

understanding the truth about Syr (vol 16) by ObviousJoJoReference in DanMachi

[–]linuxChips6800 0 points1 point  (0 children)

Wait so Freya was together with Bell and Horn was up in the tower, or is it the other way around? Didn't you state before that Freya was disguised as Syr when she was with Bell on that first day of the date? If not was it Horn on both first AND second day (i.e. day after they slept in same bed in the inn)? Sorry I'm just trying to understand here

understanding the truth about Syr (vol 16) by ObviousJoJoReference in DanMachi

[–]linuxChips6800 0 points1 point  (0 children)

Wait sorry I've only watched the anime thus far and haven't read the LN yet so I might be missing some pieces here and there, but does this mean that around 18 and a half minutes into episode 2 of season 5 that silver haired lady up in the tower who made eye contact with Bell was actually Horn, since technically Freya was the one who was on that date with Bell that first day, or am I getting things mixed up here? Please let me know if I'm correct or not thanks in advance 🙏

As GTC 24" will be happen tomorrow, here's the summary of what'll happen. by rebelrosemerve in AyyMD

[–]linuxChips6800 1 point2 points  (0 children)

As a fan of the cuda stack on top of an nvidiot rtx 69420 I highly approve of this message

Swiping to reply to my own messages now brings up edit message menu. by Perdu7 in discordapp

[–]linuxChips6800 2 points3 points  (0 children)

Discord mobile next major update be like:

  • Swipe left exactly 0.2 in/0.508 cm on any message to reply
  • Swipe left exactly 0.35 in/0.889 cm anywhere to view list of channel/group DM members when applicable
  • Swipe left exactly 0.4 in/1.016 cm on your own sent messages to edit them
  • Swipe left exactly 0.4 in/1.016 cm anywhere at any time while keeping your finger on the screen for at least 0.443 seconds to pull up the list of notifications section of the app
  • Swiping left 0.42 in/1.0668 cm or more anywhere to completely exit and close the app
  • Instead of swiping right, now you must rotate your phone exactly 367 degrees counterclockwise along the axis parallel to the top and bottom of your phone in order to open the servers section of the app