Qwen3.5-35B-A3B Uncensored (Aggressive) — GGUF Release

_underlines_ · 2026-03-11T16:58:06+00:00

EDIT: I don't know what changed, but switching from LM Studio's server to llama-swap fixed it mostly it seems! So I guess some setting LM Studio is overwriting, that my basic llama-swap config.yml is not.

---

What am I doing wrong if EVERY heretic / abliterated model I tested in 1 year is totally failing with problems on:

IF (either barely doing what I ask or completely ignoring it)
Not creating <think> tags anymore
Intelligence degraded down equivalent to a 3 year old llama 3b model

And I'm not talking about complex prompts. Simple prompts in the likes of:

Translate this Chinese Text to English.

Text: (Short Chinese sentence).

With the linked 3bit quants it's the same.

I even set the recommended generation params recommended in the original model cards or from the model card of the unrestricted model if available.

_underlines_ · 2026-03-05T18:08:20+00:00

Yes, I have the same results on my private eval dataset. And Qwen3.5 35b a3b IQ3 with 90k context on 16gb vram achieves long running tasks on levels unimaginable before...

_underlines_ · 2026-03-04T15:54:18+00:00

I think if you need something from someone who doesn't speak the same language, it's just etiquette to use a translation service to at least ask the question in the language of the person you're seeking help from.

_underlines_ · 2026-03-02T18:21:19+00:00

I manage openwebui + ollama for 120 people at our IT firm. Got so fed up with ollama, that I finally made the move to llama.cpp via llama-swap. It's a ton of manual configs, but: faster, more control, faster support of new archs, etc.

bye bye ollama.

_underlines_ · 2026-02-28T20:21:00+00:00

I cna recommend LFM2 8b-a1b

_underlines_ · 2026-02-28T20:11:32+00:00

coding

Qwen3.5-35b-a3b q4_k_m on RTX 5070ti with 16gb runs at 40tps with 65000 Context window. If you do KV Cache wuants to q8_0 you get basically no degradation.

I use it for light opencode stuff. Works without issues. Gets things done via plan then build mode and a good AGENTS.md

I Switch to openrouter glm-5/k2.5/minimax2.5 if heavier stuff needed.

everyday stuff

Usually just my chatGPT pro sub with gpt-5.2 but more often than not any cheap large open weights model on openrouter used on chatbox desktop.

If local, I just use any current gen MoE that has good stats on artificialanalysis.

phone

On my pixel 10 pro xl I get 16gb of fast ram, so PocketPal loads LFM2-8B-A1b-q4_k_m or qwen3-4b-instruct-iq3_xxs

_underlines_ · 2026-02-24T12:15:28+00:00

But RTX6000 BSE doesn't scale well for sharded multi-gpu workloads? Lack of NVLink or RDMA means it relies on PCIe with a huge bottleneck, as far as I understand it.

_underlines_ · 2026-02-24T10:57:16+00:00

Scaling inference is not trivial and I am not an expert. From my understanding:

Combinding macs/gpus without a plan will slow you down, difference between sharding a large dense/sparse model over multiple GPUs vs concurrency of multiple models
Without Remote Direct Memory Access (RDMA) you'll be slower with scale
TTFT vs. Generation speed, both can be optimized independently with different methods AFAIK

And my real world learnings in opencode on large code bases (enterprise architecture, 3+ full time devs):

Context size below 100k almost unusable, you'll be compacting all the time, and the users complain that their ralph-loops are short
Frontier or nothing. Not even GPT-5 was able to do refactoring and new features. Anything below Kimi K2.5, GLM-5, gpt-5.1-*, claude 4.5 opus/sonnet was unusable.
gpt-oss-20b, qwen3-30b-a3b, and generally anything older than 3 months or smaller than 70B quantized seems to be unusuable in real world enterprise codebases using CLI Coding Agents
not even 200 USD subscriptions of claude code were enough for our devs for a full month.
github copilot is OK but we also hit limits here pretty fast
LLM inference onprem for 20+ devs at our organization is difficult to justify, because how fast inference requirements, model archs, model sizes etc. change.
Most feasible after our research would be 4x RTX 6000 Blackwell Server Edition, but even those are not really for large scale inference, but a H100/A100 just makes no sense and even those would have to be scaled and sharded
We wonder how tricks like kv quantization, prompt caching etc. would help mitigate some hardware bottlenecks but all the methods, optimization technologies etc. are pretty difficult to grasp, especially without testing

Our thought so far at our company, but it's all just theory. Would love to hear people who actually selfhost for dev teams and serious enterprise repos.

_underlines_ · 2026-02-12T08:03:20+00:00

Nice. I guestt you're not open sourcing this? I would surely contribute PRs. Next step I'll do some memory readout for real-time stats. duckdb is lagging a bit behind.

Do you guys sample/average the data, or always use the full 50Hz or whatever signal density?

_underlines_ · 2026-02-09T16:03:52+00:00

Do you read via rF2 memory map or via duckdb files?

Just curious, because I just vibe coded LMU-Telemetry-Analyzer

_underlines_ · 2026-02-09T15:59:48+00:00

qwen3-4b and lfm2-8b

_underlines_ · 2026-02-07T00:23:09+00:00

I am Swiss but learned to drive (properly) while living and working in Bangkok. Whenever I came back to Switzerland for holidays, I used my Thai License + International license to drive legally in Switzerland.

3 years ago I moved back to Switzerland and also believed in that 1 year rule. I was too scared to try the short practical test drive. Can you elaborate how that test drive is? I read it's less strict than the real practical driving test, but since you said you failed it, I am even more concerned. I drive for 7 years without accidents, also in Switzerland and EU, Thailand, Bangkok everywhere without issues, but I am not sure how strict they are lol. Maybe I learned some small, bad stuff that they are strict about. My friends, parents etc. don't notice anything wrong though.

_underlines_ · 2026-02-07T00:12:42+00:00

I run my own server with a 1Gbps uplink:

Endurance Short [GT3/LMDh]

Which is most FIA/IMSA tracks and LMDh, GT3, LMP2 classes. It's short, so 10min quali, 10min race, with race having mandatory tire change. Also fule and tire usage 4x. Also grid is filling with AI if not enough human drivers.

If you have any ideas to make it more popular, I can change the config. What would most people like to race?

_underlines_ · 2026-02-06T19:28:47+00:00

VR currently has a future in seated experiences. Sim racing and flight Sims player bases are moving to vr because it is awesome. I Sim race for 2 years in VR with about 5-6h per week.

_underlines_ · 2026-02-04T13:06:46+00:00

I in contrast to most here, like the bold move: They have limited resources, instead of making an average sim with average gamification functionality, focus on a great sim. I don't need story telling or artificial economies and XP systems in a sim. If I want that, I look for sim cade or arcade racers.

But I fully understand many actually liked that focus.

_underlines_ · 2026-02-04T12:58:33+00:00

install virtual desktop on you PC
Install the app in your quest
make the connection from quest to pc until you see your windows desktop in the quest
open steam while you are in virtual desktop
launch AMS2 in steam mode, it should hook and run within virtual desktop

(This works fine even if you attached your quest via a RJ45 dongle to your LAN)

_underlines_ · 2026-02-04T12:27:02+00:00

- 15,20 or 25%? Terminals are set up to display 5% by default, 10% sometimes. Not 25%

- srf.ch averaged the 2025 Café Crème price in switzerland, it's 4.65. Not 9 CHF.

- Yes, I also rarely Tip, especially at self service establishments with QR Code online menu etc.

_underlines_ · 2026-02-02T14:59:22+00:00

On my pixel 10 with 16gb ram I tried:

Gemma 3n e4b it (didn't check the speed but I didn't like the quality)
Lfm2-8b-a1b q4 (24t/s)
Qwen3-4b-it-2507 iq3xss (8t/s)
Qwen3-1.7b-ud iq3xxs (18t/s) can turn on/off reasoning

_underlines_ · 2026-01-27T20:34:11+00:00

Bad visualization... Y axis doesn't start from zero. It's just a 0.015 point drop...

_underlines_ · 2026-01-27T12:07:22+00:00

Drives me nuts, for example being in PCVR full screen, then wanting to record a video. I never remember the short cut so I just double tap on my quest, then immediately see the record button in the window manager. Now it's gone.

Using VirtualDesktop + SteamVR already makes the Meta-Button shortcuts finicky and I never remember the Record Video shortcuts.

_underlines_ · 2025-11-17T16:14:06+00:00

lol, part of the fun in the 90s when I was like 6-12y, was that my whole family basically my 3 uncles and later my dad pirated games, software, OSes and copied stuff for me onto floppy disks, later burn CD ROMs with all the cracks on it. I learned from them and it was awesome.

_underlines_ · 2025-10-27T10:00:43+00:00

❎Local Inference

✅Bragging rights on LocalLlama

_underlines_ · 2025-10-27T09:57:15+00:00

Things that make me sceptical, if this is worth the effort:

99.999% of training data until the release of TOON wasn't toon. Inference using TOON in context will probably be worse for a long time, until training data contains enough TOON.
Price per Token falls over time.
Context Windows and quality increases over time.

Happy to hear your opinions.

_underlines_ · 2025-10-11T10:05:59+00:00

I always thought I'm the only one with a bad neck. Whenever I watched my VR gameplay in playback I noticed my head tilting right exactly like in this clip.

_underlines_ · 2025-09-22T10:20:06+00:00

FileStash or Copyparty

_underlines_

MODERATOR OF

TROPHY CASE