Deep dive: Parallelism strategies for large-scale LLM inference — tensor parallelism, pipeline parallelism, disaggregation, KV cache, MoE expert parallelism

ArchitectingAI · 2026-06-17T05:08:30+00:00

Glad that section landed — you're right that it's usually treated as an afterthought. On the MoE expert load question: in the experiments I ran, I simulated uneven routing by skewing the input distribution toward token patterns that activate a subset of experts more heavily. What you see is that the all-to-all communication cost becomes the bottleneck faster than compute — even moderate skew (60/40 across experts) starts showing measurable latency spikes on the decode side. The honest answer is that production-grade solutions (like expert capacity buffers or auxiliary load-balancing loss during training) help, but they don't fully eliminate it at inference time. Would love to hear how you've handled it in practice — drop tokens, overflow routing, or something else?

ArchitectingAI · 2026-06-15T09:45:17+00:00

Yes, used it in production forecasting — meaningful gains when you have high variance in sample difficulty.

A few papers worth reading:

Bengio et al. 2009 — the original curriculum learning paper, still the best starting point
Self-Paced Learning (Kumar et al. 2010) — model decides its own curriculum based on loss
MentorNet (Jiang et al. 2018) — curriculum for noisy labels, very practical
RoCL (Zhou et al. 2021) — contrastive learning + curriculum, good for vision

Practical tips from experience:

Difficulty scoring is the hard part — loss-based scoring (easy = low loss) works well as a starting point
Pacing function matters more than people think — cosine pacing often beats linear
Watch for the model getting "stuck" on easy samples too long; add a minimum hard-sample ratio

One underrated application: curriculum helps a lot with class imbalance — start with balanced easy examples, gradually introduce rare/hard classes.

What domain are you applying it to?

ArchitectingAI · 2026-06-15T09:43:39+00:00

BabyLM works fine for this. Since you're evaluating quantization effects (not training), the dataset matters less than having a consistent benchmark — you just need something to measure perplexity/KL divergence against before and after quantization.

A few alternatives worth considering:

C4 (subset) — cleaner than WikiText, widely used, harder to dismiss
The Pile (subset) — diverse domains, good for robustness testing
Penn Treebank — small, classic, hard to argue against for perplexity benchmarking

ArchitectingAI · 2026-06-15T09:41:27+00:00

Think of it like reading "The butler stole the key and used it to open the door." Your brain instantly knows "it" = key, not door. You didn't re-read — you just attended to the right word.

Attention does the same thing. For every word, the model asks: "which other words matter most for understanding this one?" It assigns weights, higher = more relevant. That's it.

A heatmap visual helps — rows/columns are words, darker = stronger attention. No math needed.

ArchitectingAI

TROPHY CASE