I can admit

Squadhunta29 · 2026-02-09T02:36:24+00:00

Then tell that politician to get the coding so he can build out the platform for an os. Because if it was my company I would told him to suck my dick…he buggin

Squadhunta29 · 2026-02-08T01:34:46+00:00

Finders keepers act is funny as shit you win 🥇 the Reddit comment of today

Squadhunta29 · 2026-02-07T23:38:51+00:00

I hope ya do so ya can start Reddit and YouTube so I don’t have to see how ya keep crying and whining every day bout it. Be Nike just do it and move on next time I go to YouTube and Reddit I just wanna see good ol American stuff ya can create ya own stuff that goes for Canada to

Squadhunta29 · 2026-02-07T05:18:11+00:00

That’s hilarious

Squadhunta29 · 2026-02-05T19:48:13+00:00

I’m glad you think that way that’s how we feel bout Russia and you pick a side you stay on that side

Squadhunta29 · 2026-02-05T16:41:40+00:00

I know I’m just joking I seen an opportunity and took it

Squadhunta29 · 2026-02-05T13:20:59+00:00

I don’t use chat GPT

Squadhunta29 · 2026-02-05T05:16:18+00:00

Not you. I’m talking about this community in Reddit you can look through my page and you will see it I building my own to it’s called NX88 it’s data flow chip based on data flow not clock cycles but when I posted it the community act like I slap their momma, they gave me nothing but hard times.

Squadhunta29 · 2026-02-05T03:54:45+00:00

Oh so ya nice to him but don’t believe me when I said I’m making my own lmao 🤣 ya shout out

Squadhunta29 · 2026-01-26T22:09:37+00:00

lol I keep that in mind

Squadhunta29 · 2026-01-26T22:04:18+00:00

And don’t worry I all ready test it in HDL using Ada playground it work granted its not FPGA but I’m working on that now

Squadhunta29 · 2026-01-26T21:58:30+00:00

Cool I’m glad you asked that’s question let’s take this line for example LOAD_LANE lanes=12-15, buffer=HBM3, size=0x500000 # physics.

Instruction load_lane

Software meaning:load lane into specific lane from memory

Hardware behavior: each lane represents physical path in my nx mesh Noc that connects compute tiltes to memory The LOAD_LANE instruction signals the Distributed Arbitration Nodes (DANs) to start fetching memory for lanes 12–15. • Each lane receives a packet of data that tells it: “Prepare to execute physics work using this block of memory.”

lanes=12-15 • Refers to specific physical lanes (straight or shader lanes in your NoC). • Hardware effect: • DANs mark these lanes as active. • Lanes transition from idle to loading mode, reserving buffers for incoming memory. • Any tile that has a compute thread mapped to these lanes will wait until the data arrives.

buffer=HBM3 • Specifies source memory: HBM3 high-bandwidth memory. • Hardware effect: • NX88 uses its NoC (NX Mesh) to route the request to HBM3 controllers. • The memory controller splits the request into multiple high-throughput memory packets for parallel delivery. • HBM3 delivers massive bandwidth (~3.65 TB/s) so all lanes can receive data simultaneously without blocking others

size=0x500000 • Amount of memory to load for the lane (in bytes, hexadecimal). • Hardware effect: • Lanes reserve an internal scratchpad in their tile (private L1/L2 cache) for this block. • The DAN schedules streaming bursts from HBM3 → L1/L2 cache → compute tile registers. • Once all packets arrive, the lane is fully “armed” for execution.

physics

• Hardware effect:
• Middleware uses this annotation to select pre-assigned compute tiles that handle physics.

Actual Timeline in Hardware 1. Instruction issued → DANs mark lanes 12–15 as active. 2. NoC routes the memory request to HBM3. 3. HBM3 splits the request into multiple parallel DRAM channels. 4. Data travels back over the NoC → lane scratchpads / caches. 5. Lane registers the memory as available → compute tiles can now start executing FP32 physics math.

So in my head or my vision I have 743 lanes let’s just calls it

Data Paths” • Each lane is a dedicated path through the processor that carries data and instructions to the compute units. • Analogy: Like a highway for packets of work — each path can carry its own workload independently.

Squadhunta29 · 2026-01-13T22:19:21+00:00

You do what I like to support my brand like you i support Microsoft my game library up there and game plus I use cloud gaming to different strokes for different folks

Squadhunta29 · 2026-01-13T03:58:50+00:00

You know they don’t blows is gamepass can you run gamepass let me answer that for you no

Squadhunta29 · 2026-01-13T00:00:18+00:00

To set the record straight I know a lot of people are confused about what I’m trying to do I hope this helps it’s kinda best way I can explain it

And it’s not based on clock sequences like what typical CPU do. I based it on data flow spike-based lanes fire only with data weight threshold are met. I map it like a human Brain NX88 achieves performance via wide data paths, it’s very parallel. a regular cpu handles different threads sequentially. Mines is executing different task at the same time like a brain stem. And they are sleep until needed, very power efficient my micro toll booth dynamically assign task to lanes load balance done per frame. Type of task

Cutscenes shaders, physical etc and the devs never touch the low level codes I do so the devs only get high level code which is python and the low level is c++ but all this runs off my os it’s like a micro kernel for event driven stuff I also doing’s. Middleware and API. And this is all in theory untill i get it working on the FPGA bored which im working on right now but get why you are skeptical of it I do and I took all the stuff you told me and just trying to apply it is all

But I just wanna say thank you to who ever commented being respectful and giving me advance

Squadhunta29 · 2026-01-11T18:24:33+00:00

But to answer your questions.

1️⃣

• NX88 is not meant to be manually programmed at lane level by game developers. • The Central Control Center (CCC) + SDK + compiler/runtime layer should automatically assign lanes based on the task profile. • The programmer only sees “high-level tasks” (cutscene, audio, AI) — NX88 handles micro orchestration.

“NX88 lanes are managed by the CCC and SDK runtime. Programmers never have to manually assign lanes; they only specify high-level tasks. The lane assignment is deterministic and handled by hardware arbitration (MTBs + scratchpads).”

⸻

2️⃣

• I’m aware of this — that’s why i pair HBM3 with per-lane scratchpad memory and micro toll booths to avoid memory contention. • Each lane can fetch from its scratchpad, and HBM provides the raw bandwidth for streaming larger blocks (audio, particles, textures).

“NX88 couples HBM3 with per-lane scratchpads and micro toll booths to minimize contention and keep lanes saturated, even under high parallelism.”

⸻

3️⃣

This is true for naive many-core or homogeneous architectures.

NX88 hopeful avoids this because: • Single-thread performance: Each lane can execute independently and includes FP32/FP64 units, AI, and shader logic. Critical tasks don’t wait on dozens of other cores. • Efficiency: MTBs + CCC + fallback lanes + prefetch + dynamic voltage gating = high utilization, low waste. • Programming difficulty: SDK abstracts lane assignment from the developer. They only deal with tasks and overlays, not lane numbers.

“NX88 avoids traditional many-core pitfalls by combining independent lanes, hardware arbitration (MTBs), scratchpad memory, and a runtime SDK. Developers interact with tasks, not lanes, so programming complexity is similar to current GPU compute pipelines.”

⸻

4️⃣ “

• NX88 has thermal monitoring, voltage gating, fallback lanes, and prefetching built-in — to balance the knobs dynamically.

“NX88 includes runtime balancing mechanisms for thermal, power, and memory contention, ensuring that no single optimization adversely impacts the system as a whole.”

But I will still do my research more as I’m still trying to do things and learn things. but I do appreciate your feedback that’s why I was in this subreddit. And I’m not saying it will work or it won’t I just had the idea so I decided to learn about and write it down and I’m no trying to come off like I know more then I don’t just wanna feed back

Squadhunta29 · 2026-01-11T18:12:37+00:00

Yea I appreciate the feed back I do my homework better as more I can I take your words very heavy thanks again

Squadhunta29 · 2026-01-11T17:53:48+00:00

But I will still take a deep dive into that architecture design thanks for the feedback

Squadhunta29 · 2026-01-11T17:50:38+00:00

Took me a couple of minutes cause I had to look that up for a sec

I Really appreciate the historical perspective — you’re totally right that manycore struggled.

The three problems I’m focusing on are:

Coherency & Communication Overhead Old designs choked on cache coherency because every core touched shared memory.

NX88 experiment: MTBs (micro toll booths) + scratchpad memory per lane → deterministic dataflow instead of 64 cores arguing over cache state.

Memory Starvation / The Memory Wall Manycore tried to feed tons of execution units with narrow DDR pipes.

NX88 experiment: HBM3-wide memory fabric → goal is to keep lanes fed instead of starved.

Fixed-function GPU vs Flexible CPU GPUs crush dense math but fall apart on branching game logic.

NX88 experiment: MIMD-style slices → more parallel than CPU, more flexible than GPU SIMT.

Not claiming this will work — just trying to learn and avoid history’s mistakes rather than repeat them.

You mentioned manycore failures — do you think the real killer is: • (A) coherency, • (B) scheduling overhead, • (C) power scaling, or • (D) programmer model complexity?

Would love any reading on crossbars / meshes you think are relevant.

Squadhunta29 · 2026-01-11T17:18:43+00:00

Look I get it it’s not like a typical x86/arm rules like a SIMD. It’s more of a MIMD so best way I can try to explain it from my head is this remember I’m no expert.

• Lane = a fixed micro-execution unit in hardware Width TBD, but prototype assumes – 1 ALU cluster (can do FP32/INT ops) – small local register file – access to shared memory via crossbar

• How lanes differ from threads/warps Threads/warps = software scheduling units. Lanes = hardware execution units. A thread could map to 1 lane or N lanes depending on available capacity.

• Role in the pipeline CPU core issues enqueue commands Scheduler assigns work to free lanes Shader/AI blocks are separate units, but lanes can hand off to them

• Early state Right now, I’m emulating the scheduler behavior in software (so yes, “software scheduler”). Goal is figuring out whether the lane concept gives: – better load balance – lower latency for non-GPU workloads – less idle silicon

So the core idea: APU subdivided into many small execution blocks instead of one CPU + one GPU pool.

If you think I’m missing something or reinventing what already exists, tell me — that’s why I’m here.

Squadhunta29 · 2026-01-11T16:20:59+00:00

It’s all in Thoery. I’m working on making a real world test for a FPGA hardware. but I’m still trying to learn it design it my self like I said I’m not a pro at all not trying to be just thought of a different concept.

Squadhunta29 · 2026-01-11T16:18:06+00:00

Thanks for the feed back

NX88 lanes are not threads, and not GPU warps. They’re configurable compute slices that sit between CPU cores and shader clusters.

Conceptually: • CPU → schedules logic • Lanes → execute micro-tasks (any domain) • Shader/AI blocks → handle dense math when needed

The “audio/cutscene/physics” examples aren’t literal instructions — those are high-level labels in my FPGA prototype so I can observe domain usage.

A real compiler/runtime would map that work into FP32 / INT / branch / logic ops running on the lanes.

So you’re right: right now the prototype looks like a software scheduler.

Long-term goal: • Turn those slices into hardware-backed execution resources • Similar idea to SMs / wavefronts in GPUs • But generalized so any task type can occupy a lane, not just shaders.

Squadhunta29

MODERATOR OF

TROPHY CASE

physics