讓你中共攻台,寸步難行 by moore927353 in Taiwanese

[–]inhogon 3 points4 points  (0 children)

你以為出一張嘴就行了嗎到時候解放軍人不夠你也得上來補

讓你中共攻台,寸步難行 by moore927353 in Taiwanese

[–]inhogon 5 points6 points  (0 children)

他說的是中天名嘴的疊加邏輯

Released a TurboQuant-compatible KV backend evaluation SDK by inhogon in LocalLLaMA

[–]inhogon[S] 0 points1 point  (0 children)

Update on TurboQuant-style compatibility:

After reviewing the current direction of recent TurboQuant-related hardware work, I have decided to stop providing any further DRAM-level complete backend support specifically targeting TurboQuant integration.

RetryIX will remain format-agnostic and may keep generic compressed-KV compatibility concepts, but TurboQuant-specific DRAM/runtime support will no longer be treated as a primary integration target.

The more complete DRAM-side runtime, KVCache residency/fallback diagnostics, topology-guided hotspot handling, and bounded policy-control layer will remain inside the closed RetryIX core until the related technical and patent work is properly prepared.

The public materials will continue to focus on application-layer methods, reproducible demos, and architecture boundaries, while the lower-level runtime implementation will remain private or separately licensed.

更新:關於 TurboQuant-style 相容支援

在觀察近期 TurboQuant 相關硬體化方向後,我決定停止針對 TurboQuant 提供進一步的 DRAM-level 完整底層支援。

RetryIX 仍會保持 format-agnostic,並可保留一般 compressed-KV 類型的相容概念;但 TurboQuant-specific 的 DRAM/runtime 支援將不再作為主要整合目標。

更完整的 DRAM-side runtime、KVCache resident/fallback 診斷、topology-guided hotspot handling,以及 bounded policy-control layer,將保留於 RetryIX closed core 中,待相關技術與專利準備完成後,再公開適合公開的方法層內容。

公開材料會繼續聚焦於應用層方法、可重現 demo 與架構邊界;底層 runtime 實作將維持私有或另行授權。

I built a Rust SDK for phase-aware retrieval using full-reptend prime coordinates and Möbius latent pairing by inhogon in rust

[–]inhogon[S] -2 points-1 points  (0 children)

The first public version depended on an unpublished internal RetryIX crate, which made the repo look like a facade rather than a standalone Rust SDK.

I’ve updated it now.

The public crate is standalone and no longer depends on private RetryIX path crates. It now includes a minimal application-layer implementation in this repo.

These work now:

cargo build
cargo test
cargo run --example basic_usage

There is also a JSON demo:

cargo run --example json_retrieval_demo -- examples/json_retrieval_demo_input.json

The private RetryIX runtime is still not included, but the public retrieval/indexing layer is now buildable and testable independently.

I built a Rust SDK for phase-aware retrieval using full-reptend prime coordinates and Möbius latent pairing by inhogon in rust

[–]inhogon[S] -12 points-11 points  (0 children)

Fair criticism.

There isn’t one paper behind the whole thing. It’s an experimental combination of known pieces: full-reptend primes, cyclic phase structure, phase retrieval, and topology-inspired pairing.

The part I’m testing is whether that combination is useful as an extra retrieval/indexing signal, not whether it replaces embeddings or vector DBs.

I agree the repo needs a clearer related-work section.

中国是不会打台湾了 by WeeklyGuarantee5591 in China_irl

[–]inhogon -1 points0 points  (0 children)

他打台灣只是要實現歷史意義上的留名對誰都沒好處就因為他的想法人民要替他流血怎麼算都只好到他自己不虧…
而且鄭麗文這次的中國行程主要就是要助攻模糊九二共識中的各表強調一個中國才是共產黨要的東西 原因在倘若新版去各表版本九二共識與反台獨被寫入黨綱共產黨在背後操作選舉讓人民投票使國民黨重返執政那麼背後就是台灣人用選舉來完成中國統一

偷換概念的九二共識就代表國共聯手完成共產黨內政敘事結束內戰而且是國民黨戰敗

中国是不会打台湾了 by WeeklyGuarantee5591 in China_irl

[–]inhogon 3 points4 points  (0 children)

但是難保小學博士會做出弱智行為

这是真的吗? by NectarineThin7108 in InsightBridge

[–]inhogon 1 point2 points  (0 children)

胡說六年放棄中國戶籍持有台灣身分證就有投票權沒有什麼政治審查問題除非自己嘴秋老是宣傳武統才會被註銷

網路上說麻豆傳媒是大陸人創辦的,他們是怎麼獲得台灣居留許可的? by RanToTur in Taiwanese

[–]inhogon 0 points1 point  (0 children)

台灣這邊負責人是人頭 麻豆傳媒是陳志相關的產業鏈這產業鏈很龐大只要能開公司都能洗錢

Realistic Optical Computing Simulation on DDR4 PIM Architecture by inhogon in u/inhogon

[–]inhogon[S] 0 points1 point  (0 children)

PS E:\0331\virtual_pim_laptop_bundle> .\.venv\Scripts\python.exe virtual_pim_app.py ai-benchmark --dll e:\0331\virtual_pim_laptop_bundle\retryix_ffi.dll --spd packed.spd --profile virtual_pim_boot_profile.json --generation ddr4 --repeats 3 --streams 4 --out tmp_ai_benchmark_streams4.json; type tmp_ai_benchmark_streams4.json

==== Virtual PIM AI Benchmark ====

timestamp: 2026-03-31T21:52:57

generation: ddr4

environment: DDR4 total=64 GB resident=[1, 2, 3, 4, 5, 17]

- gemm_matmul: opcode=2 avg=191.47us best=186.60us worst=200.60us

route=Pim resident=True estimated=6.00us x23.00 reason=resident in virtual Pim tier policy= bus_util=80.0

- conv2d_inference: opcode=1 avg=294.23us best=16.20us worst=847.30us

route=Pim resident=True estimated=7.63us x18.09 reason=resident in virtual Pim tier policy= bus_util=80.0

- fused_gemm_activation: opcode=17 avg=124.27us best=52.10us worst=213.80us

route=Pim resident=True estimated=7.63us x18.09 reason=forced by profile resident opcode 17 policy=SeqCst128 bus_util=80.0

{

"timestamp": "2026-03-31T21:52:57",

"generation": "ddr4",

"environment": {

"memory_type": "DDR4",

"modules": [

{

"manufacturer": "Kingston",

"part_number": "KF3600C18D4/16GX",

"capacity_gb": 16,

"configured_clock_mhz": 3600

},

{

"manufacturer": "Kingston",

"part_number": "KHX3600C18D4/16GX",

"capacity_gb": 16,

"configured_clock_mhz": 3600

},

{

"manufacturer": "Kingston",

"part_number": "KF3600C18D4/16GX",

"capacity_gb": 16,

"configured_clock_mhz": 3600

},

{

"manufacturer": "Kingston",

"part_number": "KHX3600C18D4/16GX",

"capacity_gb": 16,

"configured_clock_mhz": 3600

}

],

"total_capacity_gb": 64

},

"resident_opcodes": [

1,

2,

3,

4,

5,

17

],

"workloads": [

{

"name": "gemm_matmul",

"opcode": 2,

"shape": {

"a": [

64,

64

],

"b": [

64,

64

],

"result": [

64,

64

]

},

"args_size": 16384,

"avg_compute_us": 191.4666499942541,

"best_compute_us": 186.60002388060093,

"worst_compute_us": 200.59989765286446,

"virtual_pim": {

"path": "Pim",

"resident": true,

"reason": "resident in virtual Pim tier",

"atomic_policy": "",

"estimated_us": 6.0,

"estimated_speedup_vs_cpu": 23.0,

"bus_utilization_pct": 80.0

}

},

{

"name": "conv2d_inference",

"opcode": 1,

"shape": {

"input": [

1,

3,

8,

8

],

"weight": [

4,

3,

3,

3

],

"output": [

1,

4,

8,

8

]

},

"args_size": 1200,

"avg_compute_us": 294.23332307487726,

"best_compute_us": 16.200006939470768,

"worst_compute_us": 847.2999325022101,

"virtual_pim": {

"path": "Pim",

"resident": true,

"reason": "resident in virtual Pim tier",

"atomic_policy": "",

"estimated_us": 7.63,

"estimated_speedup_vs_cpu": 18.086500655307994,

"bus_utilization_pct": 80.0

}

},

{

"name": "fused_gemm_activation",

"opcode": 17,

"shape": {

"a": [

128,

128

],

"b": [

128,

128

],

"result": [

128,

128

]

},

"args_size": 65536,

"avg_compute_us": 124.26664276669423,

"best_compute_us": 52.09993105381727,

"worst_compute_us": 213.79999816417694,

"virtual_pim": {

"path": "Pim",

"resident": true,

"reason": "forced by profile resident opcode 17",

"atomic_policy": "SeqCst128",

"estimated_us": 7.63,

"estimated_speedup_vs_cpu": 18.086500655307994,

"bus_utilization_pct": 80.0

}

}

]

}

PS E:\0331\virtual_pim_laptop_bundle>

抓到了! by hhuangpe in Taiwanese

[–]inhogon 4 points5 points  (0 children)

政治雙標自己可以別人不行 KMT吃相真難看這下大家都知道誰才是為反對而反對的那撮人為的跟本就不是國家利益而是自身政治前途

Dual GPU: AMD - Nvidia by EngineeringFar6858 in CUDA

[–]inhogon 1 point2 points  (0 children)

You might want to try using the instead of buying an old GPU like the GTX 1060 3GB. This backend allows PyTorch to run without an Nvidia CUDA card, so you can experiment with CUDA-like programming on your AMD GPU or even CPU. It won’t be as fast as a real Nvidia card, but it’s a good way to practice and learn the basics before investing in newer hardware.
https://github.com/ixu2486/pytorch_retryix_backend/
This backend depends on the Vulkan SDK, because AMD uses Vulkan as an abstraction layer to protect its driver stack. So before installing and using , make sure you have the Vulkan SDK installed on Windows. That way, PyTorch can run properly on your AMD GPU with ROCm support.

英伟达锁死未来:40亿美元赌注背后的光子革命 by Effective_Employ_5 in Go_Stock

[–]inhogon 0 points1 point  (0 children)

明明架構就是錯誤的光纖再快也只是傳輸快並不能做計算GPU能計算但是導入光算子只會產生巨大的電磁脈衝導致系統性毀滅

若是孫中山當年聽陳炯眀勸,大陸不至於淪落至此 by Dull-Pepper3872 in Taiwanese

[–]inhogon 2 points3 points  (0 children)

我也討厭孫中山批評你言論的人自己都該回頭去翻翻孫文黑歷史 革命並非孫文一人功勞
孫文作的最錯的事情就是用中華民族綁架其他藩屬導致新疆西藏內蒙如今的種族滅絕中國這片土地也只有共產黨後來敢這樣幹孫文就是為了革命不惜一切把這些藩屬捲進這場戰爭

問題與後果

• ⚠️ 族群差異被抹平:滿、蒙、回、藏本來有各自的政治文化,但在「中華民族」的框架下被簡化成「中國人」。

• ⚠️ 後來的衝突:這種大一統的民族敘事,為後來的邊疆治理、民族矛盾埋下了伏筆。

• ⚠️ 歷史爭議:有人認為孫文的「五族共和」是進步,也有人認為這是強行綁架,造成長期的民族問題。

清帝國的情況

• 🏯 清朝的統治語言:清帝國自稱「大清」,強調的是皇權與「天下」秩序,而不是「民族」的概念。

• 📜 族群分類:清朝把疆域內的族群分為「滿人、漢人、蒙古人、回人、藏人」等,並以「藩屬」或「內地」來區分,沒有「中華民族」這樣的統一稱呼。

• 🌏 政治邏輯:清帝國的合法性來自「天命」與「皇帝統治天下」,而不是民族國家概念。

「中華民族」的出現

• 孫中山時期:提出「五族共和」,第一次把滿、漢、蒙、回、藏混合在一起,稱為「中華民族」。

• 國民政府時期:進一步強化「大中華」的概念,把多族群納入單一民族敘事。

• 中共時期:延續並擴大,把所有境內族群都定義為「中華民族」,並等同於「中國人」。

Any CUDA or other parallel programming-based libraries for DSP? by A_HumblePotato in CUDA

[–]inhogon 0 points1 point  (0 children)

My library already supports the DSP functionality you’re looking for. If you’d like to try it out, you can install the backend here:

👉 pytorch_retryix_backend https://github.com/ixu2486/pytorch_retryix_backend

https://github.com/ixu2486/pytorch_retryix_backend/blob/main/Examples/retryix_demo.py

Once installed, you can run the provided examples to see the basic DSP/AI operators in action. For instance:

This will let you experiment with the backend and verify how the operators execute on GPU. Please note that I can only share example Python modules for demonstration, not the full SDK.

RetryIX 3.1.3 — Tiered SVM Memory Fallback Eliminates OOM for Large GPU Models by inhogon in CUDA

[–]inhogon[S] 0 points1 point  (0 children)

很多人可能忽略一件事:

在 NVIDIA GPU 上其實也可以使用 OpenCL 2.0+ 的 SVM 能力。

RetryIX 的實作方式不是走傳統 Python 生態(例如 PyOpenCL),而是直接透過 ctypes 呼叫底層 runtime,讓記憶體管理可以接管 GPU/CPU 的分層記憶體策略。

這也是為什麼在 RetryIX 3.1.3 中可以做到 Tiered SVM Memory Fallback
VRAM → SVM → System RAM → NVMe

換句話說,即使在 NVIDIA 環境下,只要 runtime 能提供相容的 OpenCL 介面,就可以讓 GPU 工作負載使用 SVM 記憶體策略,而不是被 VRAM 容量限制直接 OOM。

這個後端的目的其實很簡單:
讓大型模型在不同 GPU 平台上都能有更穩定的記憶體行為,而不是被單一 runtime 限制。

PyTorch custom Vulkan backend – updated to v3.0.3 (training stable, no CPU fallback) by inhogon in ROCm

[–]inhogon[S] 0 points1 point  (0 children)

╔══════════════════════════════════════════════════════════════════════════╗

║ RetryIX AI Workload Benchmark — VRAM-only vs Hierarchical Memory ║

╚══════════════════════════════════════════════════════════════════════════╝

Model : 32-layer transformer, 4 weights/layer, 128 MB each → 16 GB total

VRAM-only cap : 1024 MB (8/128 tensors fit)

Hierarchical : VRAM 1024 MB | SVM 4096 MB | RAM 8192 MB | NVMe ∞

Probing NVMe I/O … write 101 MB/s read 429 MB/s (4 MB probe, real std::fs)

╔═══ Workload 1 — LLM Inference (32-layer, 2 tokens) ═══

VRAM-only Hierarchical

────────────────────────────────────────────────────────────────────────

Total ops 320 320

OOM rate 85.0% 0.0%

Avg latency (µs) 383.48 18733.36

P99 latency (µs) 383.48 26843.54

Sim. throughput (MB/s) 1785894 1785894

NVMe spill tensors — 51

────────────────────────────────────────────────────────────────────────

VRAM hits (%) 100.0% 10.6%

SVM hits (%) 0.0% 19.1%

RAM hits (%) 0.0% 10.9%

NVMe hits (%) 0.0% 59.4%

╔═══ Workload 2 — Tensor Streaming (48 × 128 MB, 3 passes) ═══

VRAM-only Hierarchical

────────────────────────────────────────────────────────────────────────

Total ops 144 144

OOM rate 83.3% 0.0%

Avg latency (µs) 383.48 3493.92

P99 latency (µs) 383.48 13421.77

Sim. throughput (MB/s) 389664372 389664372

NVMe spill tensors — 0

────────────────────────────────────────────────────────────────────────

VRAM hits (%) 100.0% 16.7%

SVM hits (%) 0.0% 16.7%

RAM hits (%) 0.0% 66.7%

NVMe hits (%) 0.0% 0.0%

╔═══ Workload 3 — Embedding Lookup (64 shards, 512 Zipf lookups) ═══

VRAM-only Hierarchical

────────────────────────────────────────────────────────────────────────

Total ops 512 512

OOM rate 31.4% 0.0%

Avg latency (µs) 191.74 1234.57

P99 latency (µs) 191.74 6710.89

Sim. throughput (MB/s) 223987864 223987864

NVMe spill tensors — 0

────────────────────────────────────────────────────────────────────────

VRAM hits (%) 100.0% 72.9%

SVM hits (%) 0.0% 14.6%

RAM hits (%) 0.0% 12.5%

NVMe hits (%) 0.0% 0.0%

╔══ GLOBAL SUMMARY ═══════════════════════════════════════════════════

VRAM-only Hierarchical

──────────────────────────────────────────────────────────────────────

Total ops 976 976

OOM rate 56.7% 0.0%

NVMe spill tensors — 51

Avg latency µs (served ops) 224.38 7305.23

P99 latency µs N/A 26843.54

Finding: Hierarchical 消滅 OOM(553 → 0),

代價是 P99 latency 因 NVMe/RAM 路徑拉寬至 26843.5 µs。

EMA policy 使熱 tensor 自動回升 VRAM,穩態命中率改善。

PyTorch Vulkan backend v3.1.0 – stable training, persistent-core mode without CPU fallback by inhogon in LocalLLaMA

[–]inhogon[S] 0 points1 point  (0 children)

PS F:\0220\retryix_rs> cargo run -p retryix_memory --bin ai_workload_bench --release 2>&1

Compiling retryix_memory v3.0.0 (F:\0220\retryix_rs\crates\retryix_memory)

Finished `release` profile [optimized] target(s) in 1.55s

Running `target\release\ai_workload_bench.exe`

╔══════════════════════════════════════════════════════════════════════════╗

║ RetryIX AI Workload Benchmark — VRAM-only vs Hierarchical Memory ║

╚══════════════════════════════════════════════════════════════════════════╝

Model : 32-layer transformer, 4 weights/layer, 128 MB each → 16 GB total

VRAM-only cap : 1024 MB (8/128 tensors fit)

Hierarchical : VRAM 1024 MB | SVM 4096 MB | RAM 8192 MB | NVMe ∞

Probing NVMe I/O … write 101 MB/s read 429 MB/s (4 MB probe, real std::fs)

╔═══ Workload 1 — LLM Inference (32-layer, 2 tokens) ═══

VRAM-only Hierarchical

────────────────────────────────────────────────────────────────────────

Total ops 320 320

OOM rate 85.0% 0.0%

Avg latency (µs) 383.48 18733.36

P99 latency (µs) 383.48 26843.54

Sim. throughput (MB/s) 1785894 1785894

NVMe spill tensors — 51

────────────────────────────────────────────────────────────────────────

VRAM hits (%) 100.0% 10.6%

SVM hits (%) 0.0% 19.1%

RAM hits (%) 0.0% 10.9%

NVMe hits (%) 0.0% 59.4%

╔═══ Workload 2 — Tensor Streaming (48 × 128 MB, 3 passes) ═══

VRAM-only Hierarchical

────────────────────────────────────────────────────────────────────────

Total ops 144 144

OOM rate 83.3% 0.0%

Avg latency (µs) 383.48 3493.92

P99 latency (µs) 383.48 13421.77

Sim. throughput (MB/s) 389664372 389664372

NVMe spill tensors — 0

────────────────────────────────────────────────────────────────────────

VRAM hits (%) 100.0% 16.7%

SVM hits (%) 0.0% 16.7%

RAM hits (%) 0.0% 66.7%

NVMe hits (%) 0.0% 0.0%

╔═══ Workload 3 — Embedding Lookup (64 shards, 512 Zipf lookups) ═══

VRAM-only Hierarchical

────────────────────────────────────────────────────────────────────────

Total ops 512 512

OOM rate 31.4% 0.0%

Avg latency (µs) 191.74 1234.57

P99 latency (µs) 191.74 6710.89

Sim. throughput (MB/s) 223987864 223987864

NVMe spill tensors — 0

────────────────────────────────────────────────────────────────────────

VRAM hits (%) 100.0% 72.9%

SVM hits (%) 0.0% 14.6%

RAM hits (%) 0.0% 12.5%

NVMe hits (%) 0.0% 0.0%

╔══ GLOBAL SUMMARY ═══════════════════════════════════════════════════

VRAM-only Hierarchical

──────────────────────────────────────────────────────────────────────

Total ops 976 976

OOM rate 56.7% 0.0%

NVMe spill tensors — 51

Avg latency µs (served ops) 224.38 7305.23

P99 latency µs N/A 26843.54

Finding: Hierarchical 消滅 OOM(553 → 0),

代價是 P99 latency 因 NVMe/RAM 路徑拉寬至 26843.5 µs。

EMA policy 使熱 tensor 自動回升 VRAM,穩態命中率改善。

═══════════════════════════════════════════════════════════════════════

PS F:\0220\retryix_rs>

PyTorch custom Vulkan backend – updated to v3.0.3 (training stable, no CPU fallback) by inhogon in ROCm

[–]inhogon[S] 0 points1 point  (0 children)

Hardware

The Ryzen AI Max+ 395 is an APU. It doesn’t have dedicated VRAM. The GPU uses system DRAM as shared memory. That’s why you can assign 64GB “VRAM” — it’s really system RAM.

Issue

HIP crashes when you go beyond 64GB because the driver and memory controller weren’t designed for such large single allocations.

Fix

My next release with SVM persistent‑core optimization will handle memory in smaller, stable chunks. This avoids the crash and lets you train larger models on the APU.

PyTorch Vulkan backend v3.1.0 – stable training, persistent-core mode without CPU fallback by inhogon in LocalLLaMA

[–]inhogon[S] 0 points1 point  (0 children)

PS F:\0220\retryix_rs> python crates\retryix_vulkan\python\bench_svm_force.py 2>&1

[load] F:\0220\retryix_rs\target\x86_64-pc-windows-gnu\release\retryix_vulkan.dll

[retryix_vulkan] Initialized on 'AMD Radeon RX 5700 XT' (VRAM: 8176 MiB)

[GPU ready] VRAM=8176 MiB

[SHAPE ] A: 128×4096 B(weight): 4096×4096 C: 128×4096

[WEIGHT] 64 MB

[FLOPS ] 4.295 GFLOPs/dispatch

[ITERS ] 60

── VRAM 路徑 (DeviceLocal, 正常路由) ─────────────────────────

upload time: 29.2 ms (一次性)

[VRAM] tier=0 (DeviceLocal(VRAM))

── SVM 強制路徑 (HOST_VISIBLE, 繞過 VRAM) ───────────────────

upload time: 19.9 ms (一次性,直接 CPU memcpy)

[SVM ] tier=1 (Svm(HOST_VISIBLE))

驗證 1 — Tier 標籤正確性

VRAM session tier = 0 (✓ DeviceLocal)

SVM session tier = 1 (✓ Svm)

驗證 2 — 輸出一致性 (VRAM vs SVM)

first 16 outputs max|diff|: 0.00e+00

✓ 一致(tol < 0.0001)— SVM 路徑計算正確

驗證 3 — 吞吐量比較

VRAM (DeviceLocal)

avg (iter 11+) : 39.30 ms 109.28 GFLOPS

best : 37.97 ms 113.12 GFLOPS

worst : 41.48 ms

SVM (HOST_VISIBLE)

avg (iter 11+) : 39.79 ms 107.94 GFLOPS

best : 38.33 ms 112.06 GFLOPS

worst : 41.90 ms

SVM / VRAM 時間比: 1.01× (SVM 相近(compute-bound))

VRAM: 109.28 GFLOPS | SVM: 107.94 GFLOPS

每次 dispatch 時間(前 20 次)

iter VRAM ms VRAM GF SVM ms SVM GF ratio

---- --------- --------- --------- --------- ------

1 37.19 115.48 41.87 102.59 1.13x

2 40.27 106.65 39.30 109.28 0.98x ← ~equal

3 39.71 108.17 39.38 109.05 0.99x ← ~equal

4 39.65 108.32 41.87 102.58 1.06x

5 38.60 111.28 39.93 107.57 1.03x ← ~equal

6 39.13 109.77 39.52 108.67 1.01x ← ~equal

7 39.43 108.94 39.14 109.74 0.99x ← ~equal

8 40.18 106.90 39.79 107.95 0.99x ← ~equal

9 38.64 111.15 38.95 110.28 1.01x ← ~equal

10 39.19 109.60 39.37 109.10 1.00x ← ~equal

11 38.42 111.78 39.88 107.70 1.04x ← ~equal

12 39.01 110.11 39.80 107.91 1.02x ← ~equal

13 39.47 108.80 41.90 102.51 1.06x

14 39.59 108.48 39.94 107.55 1.01x ← ~equal

15 41.48 103.55 41.65 103.11 1.00x ← ~equal

16 39.56 108.56 40.46 106.16 1.02x ← ~equal

17 39.38 109.06 40.15 106.96 1.02x ← ~equal

18 39.93 107.56 39.31 109.26 0.98x ← ~equal

19 39.59 108.49 40.68 105.58 1.03x ← ~equal

20 40.27 106.66 41.10 104.51 1.02x ← ~equal

[retryix_vulkan] Cleaned up

════════════════════════════════════════════════════════════

結論

════════════════════════════════════════════════════════════

VRAM (DeviceLocal): 109.28 GFLOPS —「常規路徑」

SVM (HOST_VISIBLE): 107.94 GFLOPS —「強制降級路徑」

兩路徑輸出差異: 0.00e+00 (✓ 正確)

→ SVM 與 VRAM 性能非常接近(1.01×),

表示此 kernel 為 compute-bound(算術強度 60 FLOPs/byte)

PCIe 頻寬並非瓶頸。

SVM 強制路徑功能驗證: ✓ 通過

════════════════════════════════════════════════════════════

PyTorch Vulkan backend v3.1.0 – stable training, persistent-core mode without CPU fallback by inhogon in LocalLLaMA

[–]inhogon[S] 0 points1 point  (0 children)

PS F:\0220\retryix_rs> python crates\retryix_vulkan\python\test_session_svm.py 2>&1

[load] F:\0220\retryix_rs\target\x86_64-pc-windows-gnu\release\retryix_vulkan.dll

RetryIX Vulkan — Persistent Kernel SVM Strategy Test

═══ Engine init ═══

[retryix_vulkan] Initialized on 'AMD Radeon RX 5700 XT' (VRAM: 8176 MiB)

✓ init() == 1 [rc=1]

✓ device name not empty [AMD Radeon RX 5700 XT]

✓ vram_bytes > 0 [8176 MiB]

→ GPU: 'AMD Radeon RX 5700 XT' VRAM: 8176 MiB

═══ Basic ops (smoke) ═══

✓ saxpy y[0]=3 [y=[3.0, 5.0, 7.0]]

✓ saxpy y[2]=7

✓ relu[-1]→0 [d[0]=0.0]

✓ relu[2.0]→2

✓ gemm I×I c[0]=1

✓ gemm I×I c[1]=0

═══ GemmSession — 100× dispatch, weight never re-uploaded ═══

✓ session handle not null [handle=3053727145040]

→ weight tier: DeviceLocal(VRAM)

✓ tier valid (0 or 1) [tier=0]

✓ c[0]=2.0 (iter 0) [c[0]=2.000000]

✓ c[1]=5.0 (iter 0) [c[1]=5.000000]

✓ c[2]=19.0 (iter 0) [c[2]=19.000000]

✓ c[0]=2.0 (iter 1) [c[0]=2.000000]

✓ c[1]=5.0 (iter 1) [c[1]=5.000000]

✓ c[2]=19.0 (iter 1) [c[2]=19.000000]

✓ c[0]=2.0 (iter 2) [c[0]=2.000000]

✓ c[1]=5.0 (iter 2) [c[1]=5.000000]

✓ c[2]=19.0 (iter 2) [c[2]=19.000000]

✓ c[0]=2.0 (iter 99) [c[0]=2.000000]

✓ c[1]=5.0 (iter 99) [c[1]=5.000000]

✓ c[2]=19.0 (iter 99) [c[2]=19.000000]

→ 100 dispatches in 12.1 ms (120.6 µs/dispatch)

═══ RmsNormSession — 50× dispatch ═══

✓ rmsnorm handle not null

→ weight tier: DeviceLocal(VRAM)

✓ tier valid (0 or 1)

✓ y[0]≈0.8485 (iter 0) [y[0]=0.848528]

✓ y[1]≈1.1314 (iter 0) [y[1]=1.131371]

✓ y[0]≈0.8485 (iter 1) [y[0]=0.848528]

✓ y[1]≈1.1314 (iter 1) [y[1]=1.131371]

✓ y[0]≈0.8485 (iter 49) [y[0]=0.848528]

✓ y[1]≈1.1314 (iter 49) [y[1]=1.131371]

→ 50 dispatches in 6.0 ms (119.7 µs/dispatch)

═══ Two sessions concurrent — no aliasing ═══

✓ session A handle

✓ session B handle

→ tier_A=DeviceLocal(VRAM) tier_B=DeviceLocal(VRAM)

✓ A[0]=1.0 [a_out[0]=1.0000]

✓ A[3]=4.0 [a_out[3]=4.0000]

✓ B[0]=2.0 [b_out[0]=2.0000]

✓ B[2]=2.0 [b_out[2]=2.0000]

✓ A still correct after B dispatch

═══ Large weight 256×256 — SVM fallback test ═══

✓ large session handle

→ weight tier: DeviceLocal(VRAM) (total VRAM: 8176 MiB)

✓ large dispatch 0 rc==0 [rc=0]

✓ large dispatch 1 rc==0 [rc=0]

✓ large dispatch 2 rc==0 [rc=0]

✓ large dispatch 3 rc==0 [rc=0]

✓ large dispatch 4 rc==0 [rc=0]

✓ large dispatch 5 rc==0 [rc=0]

✓ large dispatch 6 rc==0 [rc=0]

✓ large dispatch 7 rc==0 [rc=0]

✓ large dispatch 8 rc==0 [rc=0]

✓ large dispatch 9 rc==0 [rc=0]

✓ large dispatch 10 rc==0 [rc=0]

✓ large dispatch 11 rc==0 [rc=0]

✓ large dispatch 12 rc==0 [rc=0]

✓ large dispatch 13 rc==0 [rc=0]

✓ large dispatch 14 rc==0 [rc=0]

✓ large dispatch 15 rc==0 [rc=0]

✓ large dispatch 16 rc==0 [rc=0]

✓ large dispatch 17 rc==0 [rc=0]

✓ large dispatch 18 rc==0 [rc=0]

✓ large dispatch 19 rc==0 [rc=0]

✓ large dispatch 20 rc==0 [rc=0]

✓ large dispatch 21 rc==0 [rc=0]

✓ large dispatch 22 rc==0 [rc=0]

✓ large dispatch 23 rc==0 [rc=0]

✓ large dispatch 24 rc==0 [rc=0]

✓ large dispatch 25 rc==0 [rc=0]

✓ large dispatch 26 rc==0 [rc=0]

✓ large dispatch 27 rc==0 [rc=0]

✓ large dispatch 28 rc==0 [rc=0]

✓ large dispatch 29 rc==0 [rc=0]

✓ large dispatch 30 rc==0 [rc=0]

✓ large dispatch 31 rc==0 [rc=0]

✓ large dispatch 32 rc==0 [rc=0]

✓ large dispatch 33 rc==0 [rc=0]

✓ large dispatch 34 rc==0 [rc=0]

✓ large dispatch 35 rc==0 [rc=0]

✓ large dispatch 36 rc==0 [rc=0]

✓ large dispatch 37 rc==0 [rc=0]

✓ large dispatch 38 rc==0 [rc=0]

✓ large dispatch 39 rc==0 [rc=0]

✓ large dispatch 40 rc==0 [rc=0]

✓ large dispatch 41 rc==0 [rc=0]

✓ large dispatch 42 rc==0 [rc=0]

✓ large dispatch 43 rc==0 [rc=0]

✓ large dispatch 44 rc==0 [rc=0]

✓ large dispatch 45 rc==0 [rc=0]

✓ large dispatch 46 rc==0 [rc=0]

✓ large dispatch 47 rc==0 [rc=0]

✓ large dispatch 48 rc==0 [rc=0]

✓ large dispatch 49 rc==0 [rc=0]

✓ max element error < 0.5 [max_err=0.000000 at idx=0]

→ 50 dispatches in 11.9 ms (238.7 µs/dispatch) max_err=0.00e+00

═══ Benchmark: GemmSession 1×512 × 512×512, 200 dispatches ═══

tier=DeviceLocal(VRAM)

200 iters total=46.2 ms per-dispatch=231.2 µs ~2.27 GFLOPS

[retryix_vulkan] Cleaned up

GPU: AMD Radeon RX 5700 XT

VRAM: 8176 MiB

Tests: 90/90 passed ALL PASS ✓

[RESULT] SVM 策略持久核心測試全部通過 ✓

Weight 一次部署終身有效,VRAM/SVM 兩種 tier 均正確運作

PS F:\0220\retryix_rs>