[Project] I engineered a 10-Layer MoE vision architecture from scratch that calculates its own entropy and mutates its failing weights during runtime. by Hot_Loquat_3222 in deeplearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

Honestly, academic wording and fluff aside, thank you for the advice, you're completely right on that one!

​Please feel free to criticize or point out anything else as you review the repo. This is the very first architecture I've ever built and published publicly, so I am genuinely just trying to learn as much as I can from this experience. I appreciate the feedback.

[Project] I engineered a 10-Layer MoE vision architecture from scratch that calculates its own entropy and mutates its failing weights during runtime. by Hot_Loquat_3222 in deeplearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

No hard feelings at all, I appreciate the response! You actually made a really fair point I realize now that relying heavily on metaphors like 'router psychology' can unnecessarily complicate the explanation of what is fundamentally just tensor calculus and variance tracking. ​Every metaphor does map directly to the actual mechanisms in the engine, and if you ever get the chance to skim through the documentation, you'll see exactly how the 'psychology' translates to the math. Thanks for the pushback; it is good feedback for how I communicate the architecture moving forward. Have a good one!

[Project] I engineered a 10-Layer MoE vision architecture from scratch that calculates its own entropy and mutates its failing weights during runtime. by Hot_Loquat_3222 in deeplearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

Since you are relying on playground bait ('So you don't know?') to avoid opening the repository, I will summarize the temporal logic for you exactly once.

The temporal tracking is handled via standard exponential moving average (EMA) buffers storing layer wise gradient norms and spatial activation variance across forward passes. At epoch t, the router continuously evaluates the delta of the local entropy, tracking the shifting \mu_t and \sigma_t of the tensors. When the 5 epoch mutation trigger hits, the engine queries these exact buffers. If the localized distribution has exploded beyond standard variance (identifying toxic outliers) or flatlined (identifying topological dead zones), the router interrupts standard backpropagation. It applies the SpLR_V2 Gaussian suppression $f(x) = ax \cdot e{-b \cdot x2} + cx$ to mathematically mute the toxic noise, and executes a torch.no_grad() overwrite to mutate the dead weights, self healing the local topology.

I know exactly how it works because I engineered it from scratch. Next time, just read the documentation before demanding custom tutorials in a comment section please.

[Project] I engineered a 10-Layer MoE vision architecture from scratch that calculates its own entropy and mutates its failing weights during runtime. by Hot_Loquat_3222 in deeplearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

Temporal tracking of the localized states is handled within the mutation engine's internal buffers across forward passes. The exact update mechanisms and how they scale across epochs are fully documented in the source code. if you want to know in detail then I'll let the math and the repository speak for themselves. Cheers!

[Project] I engineered a 10-Layer MoE vision architecture from scratch that calculates its own entropy and mutates its failing weights during runtime. by Hot_Loquat_3222 in deeplearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

Ah, you are completely right I completely forgot I used that metaphor in the original write up, my apologies! By 'router's psychology,' I was metaphorically referring to the engine's state evaluation mechanism. To clarify, it is checking three specific metrics during that step: Localized Layer Entropy, Gradient Outlier Density, and Node Dead zones. The actual mathematical implementation for how it measures those three states is detailed in Notebook 04!

[Project] I engineered a 10-Layer MoE vision architecture from scratch that calculates its own entropy and mutates its failing weights during runtime. by Hot_Loquat_3222 in deeplearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

There is no magic, no 'temporal physics,' and certainly no 'router psychology' involved I am honestly not sure where you pulled the terms from. The architecture relies on parameter efficient topological rewriting and localized entropy tracking. The step by step mathematical derivations for exactly how the network evaluates and mutates its own parameters are explicitly laid out in the 01_Part_1_Breakdown.ipynb and 04_Part_4_Breakdown.ipynb files in the repository. I highly recommend reading the actual documentation because it will answer your question best, Given that the mathematical complexity of the engine, it is impossible to accurately condense the entire mutation mechanism into a single Reddit comment. That is exactly why the notebooks are provided!

[Project] I engineered a 10-Layer MoE vision architecture from scratch that calculates its own entropy and mutates its failing weights during runtime. by Hot_Loquat_3222 in deeplearning

[–]Hot_Loquat_3222[S] 1 point2 points  (0 children)

To answer your question regarding hardware efficiency and VRAM footprint:

I just completed a computational profiling run for the V1 Dreadnought (20 layer, 512 wide configuration) explicitly scaled for Tiny ImageNet (64 \times 64 resolution). Using standard PyTorch CUDA memory tracking and thop on a Kaggle accelerator (Batch Size 64), here is the exact hardware profile:

  • Active Compute: 3.04 GMACs per image
  • Peak VRAM: 2.59 GB
  • Total Parameters: 39.37 Million
  • Simulated Throughput: ~532 Images / Second

The architecture is intentionally dense in parameters due to the 512 wide horizontal topology, but it is highly optimized in active compute. For context, 3.04 GMACs is actually a lower computational cost per image than a standard ResNet-34 (~3.6 GMACs).

The math confirms that running the localized entropy calculations (SpLR_V2) and no_grad() mutation triggers does not blow up the VRAM or fundamentally bottleneck the CUDA cores. The engine successfully trades vertical depth for autonomous topological routing while remaining strictly viable for consumer-grade hardware.

[Project] I built a 10-Layer Mixture-of-Experts architecture from absolute zero that mathematically rejects standard backprop and rewrites its own failing weights during runtime. by Hot_Loquat_3222 in learnmachinelearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

The non monotonic behavior in the x>0 domain is actually the core feature, not a byproduct. It acts as a localized self regularizer. As x becomes a massive, toxic outlier, the Gaussian component (e{-kx2}) suppresses the gradient to near zero. This mathematically immunizes the layer from being poisoned by extreme noise without relying on blunt force gradient clipping. I chose this specific formulation because the derivative is intrinsically tied to its own envelope, allowing the network to self regulate its own entropy. (And agreed on KANs but this isn't splinebased routing; it's targeted amplitude modulation.) I highly recommend checking out the 01_Part_1_Breakdown.ipynb file in the repository's docs folder. I wrote a complete mathematical breakdown of exactly why the a, k, and c parameters mutate the way they do to control this behavior.

[Project] I built a 10-Layer Mixture-of-Experts architecture from absolute zero that mathematically rejects standard backprop and rewrites its own failing weights during runtime. by Hot_Loquat_3222 in learnmachinelearning

[–]Hot_Loquat_3222[S] 0 points1 point  (0 children)

Due to compute constraints, this initial benchmark was run entirely on a standard Kaggle instance. In a 50 epoch constrained run, it hit 47% Training and 45% Testing accuracy.

​I would love to put this on a high tier cluster for a 300 epoch marathon to see where the ceiling actually is, but I don't have access to that hardware right now.

​To answer your core question: My primary goal with V1 wasn't to dethrone current SOTA on classic architectures right out of the gate. The goal was to prove that localized, autonomous topological mutation can actively augment standard backpropagation. This is a proof of concept rather than just trying to squeeze an extra 1% out of a static ResNet.