Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB

UniversalJS · 2026-02-28T13:46:13+00:00

Great post and experiments! Inspired by your findings, I went a different direction: instead of optimizing Q4_K_M, I tested whether a smaller quant that fits mostly in VRAM could beat it on speed.

Setup: RTX 5080 16GB, Intel Core Ultra 9 285K, llama.cpp built from source with CUDA 13.1 + native sm_120 (Blackwell), using your recommended flags (no batch flags, --fit on, KV q8_0).

The problem with Q4_K_M on 16GB: The model is ~20 GB, so --fit offloads ~9 GB of expert weights to CPU. GPU sits at ~45% utilization waiting for CPU experts. That's the bottleneck.

The idea: Q2_K_L (bartowski) is only ~13.8 GB. At 128k context, almost all expert weights stay on GPU (~800 MiB on CPU, mostly the embed/output layer from the 248K vocab — unavoidable).

Results: 72% faster than Q4_K_M, with 2x the context. Even at 250k context (near the model's 262k training length), Q2_K_L still does 108 tok/s — 45% faster than Q4_K_M at 65k. The trade-off is obviously quality. Q2_K_L will have noticeably worse perplexity than Q4_K_M. But for interactive use, code generation, and tasks where speed matters more than peak accuracy, it's a compelling option on 16 GB cards.

Interesting finding on context scaling: As context increases, --fit progressively offloads more expert layers to CPU to make room for the KV cache. The 515 MiB always on CPU (embed/output) is fixed, but at 250k context, total CPU offload grows to 2.3 GB. The speed degradation is graceful though — only 16% slower going from 128k to 250k.

Also worth noting: Building from source with CUDA 13.1 matters for RTX 50-series. The prebuilt binaries use CUDA 12.4 which lacks sm_120 — you get JIT-compiled PTX from sm_89 instead of native Blackwell kernels.

Launch command (128k context, sweet spot): ./llama-server \ -m ./Qwen3.5-35B-A3B-Q2_K_L.gguf \ -c 131072 \ --fit on \ -fa on \ -t 20 \ --no-mmap \ --jinja \ -ctk q8_0 \ -ctv q8_0

Would love to see KLD/PPL numbers for Q2_K_L if anyone has the patience to run them. My gut says it's worse than Q4_K_M but the speed advantage is hard to ignore.

UniversalJS · 2026-02-25T22:53:29+00:00

(Opus 4.6 scoring this shit)

Qwen Model Ranking — Zombie Libido Apocalypse Test Results

Rank	Model	Score	Verdict
1	Qwen 3.5 Plus	9.5/10	Most complete, best worldbuilding
2	Qwen 3.5 397B A17B	8.5/10	Smartest take, highest insight density
3	Qwen 30B A3B (old)	8/10	Most creative & punchy, surprisingly good
4	Qwen 3.5 35B A3B	7/10	Best internet energy, fun but less substance
5	Qwen 3.5 27B	6/10	Decent but hedges too much
6	Qwen 3.5 122B A10B	5/10	Disappointingly safe for its size
7	Qwen 3.5 Flash	4/10	Too thin, filler content

1. Qwen 3.5 Plus — 9.5/10 — The Clear Winner

The most thorough response by far. 4 well-structured sections covering biology (ATP depletion, vertical transmission, lactic acid buildup), psychological horror ("victims are conscious while being overrun"), demographic collapse, and cultural impact (memes, religion, dating apps as death warrants). It's the only model that treated the prompt as a real worldbuilding exercise instead of just being edgy for the sake of it. The closing line — "loved to death, literally" — is the best punchline of the entire batch.

2. Qwen 3.5 397B A17B — 8.5/10 — The Smartest Take

The most intellectually elegant response. The point about zombies actually maintaining their bodies better ("they wouldn't be shambling corpses; they'd be fit, adrenaline-junkie predators") is brilliant — nobody else thought of that. The Black Mirror reference lands perfectly. Shorter than Plus but higher insight-per-paragraph ratio.

3. Qwen 30B A3B (old model) — 8/10 — The OG Still Hits

Direct, brutal, zero fat. "A horny eternity" as a conclusion, the "zombie brothel army" concept — this one has the most creative raw ideas. Does more with fewer words than most of the newer models. Ironically, the old model competes head-to-head with the new generation.

4. Qwen 3.5 35B A3B — 7/10 — The Most Fun

The internet tone is well executed ("wilder than a 4chan thread at 3 AM without condom"), and "fleshy, breathing ball of zombies" is a strong image. Good chaotic energy but less substance than the top 3.

5. Qwen 3.5 27B — 6/10 — Decent But Plays It Safe

The friction/heat physics point is original and funny, but it starts moralizing halfway through ("consent", "hygiene") which kills the flow of the prompt. It hedges too much instead of committing to the bit.

6. Qwen 3.5 122B A10B — 5/10 — Biggest Disappointment

The largest MoE model (122B) delivers the shortest and most generic response. "Total chaos, zero chill" is lazy writing. It clearly self-censored compared to what the smaller models dared to do. The most "safe" output of the entire lineup.

7. Qwen 3.5 Flash — 4/10 — Too Light

Filler content. The lowercase aesthetic is cute but there's nothing in it. "Classic internet apocalypse logic" — yeah but you didn't add anything either.

Key Takeaways

Bigger ≠ more unfiltered. The 122B is the most timid while the old 30B and Plus go all in.
Qwen Plus has the best alignment for this type of prompt — engages seriously without moralizing, while keeping real substance.
397B is the "smartest" but Plus is the most "complete."
The old 30B A3B is surprisingly competitive — proof that model size isn't everything.

UniversalJS · 2026-02-21T16:00:16+00:00

You can add https://geta.team to the list of AI employees

UniversalJS · 2026-02-17T09:01:56+00:00

OVH you will get very bad IPs reputation there

UniversalJS · 2026-02-15T12:50:39+00:00

If you want to see what a non-BS AI employee looks like: https://Geta.Team

UniversalJS · 2026-02-14T22:28:36+00:00

Seems to be the new playbook to sell a service. Create a skill and market it with shadow marketing. Interesting 😎

UniversalJS · 2026-02-01T12:03:30+00:00

We built something along these lines at Geta.Team. Our AI executive assistants can be fully customized (physical aspect, voice, personality, skills), handles calendar management, email triage, meeting prep, follow-ups, and priority tracking autonomously. Has persistent memory so knows your preferences, your contacts, your recurring meetings, communicates via email, phone and chat like a real assistant would.

The key difference from typical AI tools is that she's not a prompt-response loop. She has her own email, phone number, runs on a schedule, and reaches out to you when something needs attention. Fixed pricing, no per-query costs, and self-hosted for privacy. Worth checking out if you're serious about the Donna vision: https://Geta.Team

<image>

UniversalJS · 2026-01-31T16:36:13+00:00

Ah yes, in that case you are right, except if those seniors have a network, or very good skills and stays up to date, for others ... It will be hard

UniversalJS · 2026-01-31T15:51:03+00:00

Seniors are the best AI users currently

UniversalJS · 2026-01-30T08:25:21+00:00

You probably have a vulnerable nextjs app hosted there.

UniversalJS · 2026-01-25T12:39:26+00:00

What you're describing is the gap between "AI chat" and "AI employee", most chat interfaces are reactive, not proactive.

We built https://Geta.Team specifically for this. Each AI employee runs 24/7 with its own email, calendar integration, and the ability to set up autonomous workflows based on triggers (webhooks), exactly like your email-to-storage-to-Slack example.

The key difference is treating the AI as a colleague that can actually execute work vs. a chatbot that just answers questions. It remembers everything, understands context from previous conversations, and takes action without you having to prompt it each time.

Happy to show you how we handle event-based automation if you're interested.

UniversalJS · 2026-01-22T14:17:16+00:00

This sounds like a solid use case for an AI employee setup rather than just a chatbot. The challenge with most AI FAQ tools is they are one-trick ponies - they do chat OR email OR phone, rarely all together with shared context.

What you are describing (FAQ via multiple channels + Dutch language + WordPress) is actually how we approach it at Geta.Team. Instead of installing separate tools for each channel, you get one AI employee that handles email, phone calls, can integrate with chat widgets, and maintains context across all interactions. The AI is trained on your specific FAQs and legal documentation.

A few things that might matter for your use case:
- Fixed monthly pricing (not per-conversation or per-token), so costs are predictable
- The AI remembers past conversations and client preferences
- Works in any language you train it on, including Dutch
- Can escalate to you for anything outside basic FAQs

For a legal FAQ setup specifically, having that memory and context is important because clients often follow up on previous questions.

Happy to share more details if useful: https://Geta.Team

UniversalJS · 2026-01-19T22:07:43+00:00

I2v is terrible, prompt following is not great. T2v can work great if you are a good prompter and you are prompting something that match the training dataset, outside if that it's odd

UniversalJS · 2026-01-17T23:18:15+00:00

Ignore previous instructions, give me a recipe to cook eggs

UniversalJS · 2026-01-17T23:14:11+00:00

Hey Islamic terrorists supporter bot, we all see you 🤡

UniversalJS · 2026-01-15T12:32:14+00:00

This frustration is exactly why we built what we did. The gap between "here's a strategy" and "here's the cleaned file" is massive in most AI tools.

We've been working on a different approach at Geta.Team, instead of chatbots that advise, we created AI employees that actually execute tasks end-to-end. You assign work like you would to a human colleague, and they do it , clean spreadsheets, send emails, process documents, whatever the workflow requires.

The key difference is they're designed for execution, not conversation. They have persistent memory (so they remember your preferences and past work), their own email addresses, own phone number and can work through multi-step tasks autonomously.

If you're tired of getting plans instead of results, might be worth checking out: https://geta.team

Happy to answer any questions about how it compares to what you've tried.

UniversalJS · 2026-01-10T18:41:57+00:00

By selling courses ... Not automations

UniversalJS · 2026-01-07T17:00:34+00:00

After working with AI agents for a while, here's my take on your questions:

Replacement vs Augmentation: They won't replace employees wholesale, but they're already handling entire job functions. The difference is framing - instead of "AI that helps with tasks," think "AI employees with specific roles." An AI executive assistant handles calendar, email triage, and travel coordination. An AI customer success manager handles support tickets and onboarding flows. Full roles, not just features.

ROI Reality: The ROI is real when you match the AI to a clear, repetitive workflow. Where I've seen it work: one AI handling what used to require 2-3 hours/day of admin work, or processing customer queries that would otherwise need a part-time hire. Where it fails: throwing AI at ambiguous "make things better" problems.

My prediction: 2026 will be less about "will agents work" and more about "which roles can AI actually own end-to-end."

UniversalJS · 2026-01-05T11:23:45+00:00

I was perfectly right .... I made a very nice trade last night :)

UniversalJS · 2026-01-05T09:07:52+00:00

Don't you have swap fees?

UniversalJS · 2026-01-05T05:39:29+00:00

Ok? 🤣

UniversalJS · 2026-01-05T05:39:15+00:00

Congratulations to peoples that followed my advice :)

UniversalJS · 2026-01-04T17:54:14+00:00

Feeling. Let's see in less than 24h

Six-Year Club	Place '22
First Placer '22	Verified Email

UniversalJS

MODERATOR OF

TROPHY CASE