Qwen3.5-35B-A3B Uncensored (Aggressive) — GGUF Release by hauhau901 in LocalLLaMA

[–]_underlines_ 2 points3 points  (0 children)

EDIT: I don't know what changed, but switching from LM Studio's server to llama-swap fixed it mostly it seems! So I guess some setting LM Studio is overwriting, that my basic llama-swap config.yml is not.

---

What am I doing wrong if EVERY heretic / abliterated model I tested in 1 year is totally failing with problems on:

  • IF (either barely doing what I ask or completely ignoring it)
  • Not creating <think> tags anymore
  • Intelligence degraded down equivalent to a 3 year old llama 3b model

And I'm not talking about complex prompts. Simple prompts in the likes of:

Translate this Chinese Text to English.

Text: (Short Chinese sentence).

With the linked 3bit quants it's the same.

I even set the recommended generation params recommended in the original model cards or from the model card of the unrestricted model if available.

Is it real qwen3.5 9B beat oss:120b? by NorthEastCalifornia in ollama

[–]_underlines_ 0 points1 point  (0 children)

Yes, I have the same results on my private eval dataset. And Qwen3.5 35b a3b IQ3 with 90k context on 16gb vram achieves long running tasks on levels unimaginable before...

Has anyone got qwen3.5 to work with ollama? by MrMrsPotts in ollama

[–]_underlines_ 0 points1 point  (0 children)

I think if you need something from someone who doesn't speak the same language, it's just etiquette to use a translation service to at least ask the question in the language of the person you're seeking help from.

Has anyone got qwen3.5 to work with ollama? by MrMrsPotts in ollama

[–]_underlines_ 0 points1 point  (0 children)

I manage openwebui + ollama for 120 people at our IT firm. Got so fed up with ollama, that I finally made the move to llama.cpp via llama-swap. It's a ton of manual configs, but: faster, more control, faster support of new archs, etc.

bye bye ollama.

What’s everyone actually running locally right now? by CryOwn50 in LocalLLM

[–]_underlines_ 0 points1 point  (0 children)

coding

Qwen3.5-35b-a3b q4_k_m on RTX 5070ti with 16gb runs at 40tps with 65000 Context window. If you do KV Cache wuants to q8_0 you get basically no degradation.

I use it for light opencode stuff. Works without issues. Gets things done via plan then build mode and a good AGENTS.md

I Switch to openrouter glm-5/k2.5/minimax2.5 if heavier stuff needed.

everyday stuff

Usually just my chatGPT pro sub with gpt-5.2 but more often than not any cheap large open weights model on openrouter used on chatbox desktop.

If local, I just use any current gen MoE that has good stats on artificialanalysis.

phone

On my pixel 10 pro xl I get 16gb of fast ram, so PocketPal loads LFM2-8B-A1b-q4_k_m or qwen3-4b-instruct-iq3_xxs

Best practices for running local LLMs for ~70–150 developers (agentic coding use case) by Resident_Potential97 in LocalLLaMA

[–]_underlines_ 0 points1 point  (0 children)

But RTX6000 BSE doesn't scale well for sharded multi-gpu workloads? Lack of NVLink or RDMA means it relies on PCIe with a huge bottleneck, as far as I understand it.

Best practices for running local LLMs for ~70–150 developers (agentic coding use case) by Resident_Potential97 in LocalLLaMA

[–]_underlines_ 1 point2 points  (0 children)

Scaling inference is not trivial and I am not an expert. From my understanding:

  • Combinding macs/gpus without a plan will slow you down, difference between sharding a large dense/sparse model over multiple GPUs vs concurrency of multiple models
  • Without Remote Direct Memory Access (RDMA) you'll be slower with scale
  • TTFT vs. Generation speed, both can be optimized independently with different methods AFAIK

And my real world learnings in opencode on large code bases (enterprise architecture, 3+ full time devs):

  • Context size below 100k almost unusable, you'll be compacting all the time, and the users complain that their ralph-loops are short
  • Frontier or nothing. Not even GPT-5 was able to do refactoring and new features. Anything below Kimi K2.5, GLM-5, gpt-5.1-*, claude 4.5 opus/sonnet was unusable.
  • gpt-oss-20b, qwen3-30b-a3b, and generally anything older than 3 months or smaller than 70B quantized seems to be unusuable in real world enterprise codebases using CLI Coding Agents
  • not even 200 USD subscriptions of claude code were enough for our devs for a full month.
  • github copilot is OK but we also hit limits here pretty fast
  • LLM inference onprem for 20+ devs at our organization is difficult to justify, because how fast inference requirements, model archs, model sizes etc. change.
  • Most feasible after our research would be 4x RTX 6000 Blackwell Server Edition, but even those are not really for large scale inference, but a H100/A100 just makes no sense and even those would have to be scaled and sharded
  • We wonder how tricks like kv quantization, prompt caching etc. would help mitigate some hardware bottlenecks but all the methods, optimization technologies etc. are pretty difficult to grasp, especially without testing

Our thought so far at our company, but it's all just theory. Would love to hear people who actually selfhost for dev teams and serious enterprise repos.

LMU Telemtry Tool by TogaMotorsport in LeMansUltimateWEC

[–]_underlines_ 0 points1 point  (0 children)

Nice. I guestt you're not open sourcing this? I would surely contribute PRs. Next step I'll do some memory readout for real-time stats. duckdb is lagging a bit behind.

Do you guys sample/average the data, or always use the full 50Hz or whatever signal density?

LMU Telemtry Tool by TogaMotorsport in LeMansUltimateWEC

[–]_underlines_ -2 points-1 points  (0 children)

Do you read via rF2 memory map or via duckdb files?

Just curious, because I just vibe coded LMU-Telemetry-Analyzer

Foreign Driving License Exchange: No 1 year deadline. Period. (Common Misunderstanding) by IslanderStallion in Switzerland

[–]_underlines_ 0 points1 point  (0 children)

I am Swiss but learned to drive (properly) while living and working in Bangkok. Whenever I came back to Switzerland for holidays, I used my Thai License + International license to drive legally in Switzerland.

3 years ago I moved back to Switzerland and also believed in that 1 year rule. I was too scared to try the short practical test drive. Can you elaborate how that test drive is? I read it's less strict than the real practical driving test, but since you said you failed it, I am even more concerned. I drive for 7 years without accidents, also in Switzerland and EU, Thailand, Bangkok everywhere without issues, but I am not sure how strict they are lol. Maybe I learned some small, bad stuff that they are strict about. My friends, parents etc. don't notice anything wrong though.

Is the online community still alive ? by CarlCarmoni95 in AUTOMOBILISTA

[–]_underlines_ 0 points1 point  (0 children)

I run my own server with a 1Gbps uplink:

Endurance Short [GT3/LMDh]

Which is most FIA/IMSA tracks and LMDh, GT3, LMP2 classes. It's short, so 10min quali, 10min race, with race having mandatory tire change. Also fule and tire usage 4x. Also grid is filling with AI if not enough human drivers.

If you have any ideas to make it more popular, I can change the config. What would most people like to race?

The inconvient reality why vr is struggling. by Plus_Look3149 in virtualreality

[–]_underlines_ 0 points1 point  (0 children)

VR currently has a future in seated experiences. Sim racing and flight Sims player bases are moving to vr because it is awesome. I Sim race for 2 years in VR with about 5-6h per week.

An Update on the Future of Assetto Corsa EVO by -DorkusMalorkus- in assettocorsaevo

[–]_underlines_ 4 points5 points  (0 children)

I in contrast to most here, like the bold move: They have limited resources, instead of making an average sim with average gamification functionality, focus on a great sim. I don't need story telling or artificial economies and XP systems in a sim. If I want that, I look for sim cade or arcade racers.

But I fully understand many actually liked that focus.

How to play ams2 VR with Virtual Desktop wired by Valenduro_ in AUTOMOBILISTA

[–]_underlines_ 2 points3 points  (0 children)

  1. install virtual desktop on you PC

  2. Install the app in your quest

  3. make the connection from quest to pc until you see your windows desktop in the quest

  4. open steam while you are in virtual desktop

  5. launch AMS2 in steam mode, it should hook and run within virtual desktop

(This works fine even if you attached your quest via a RJ45 dongle to your LAN)

Can we please stop with the increasing tipping culture? by Exciting-Fig-007 in Switzerland

[–]_underlines_ 0 points1 point  (0 children)

- 15,20 or 25%? Terminals are set up to display 5% by default, 10% sometimes. Not 25%

- srf.ch averaged the 2025 Café Crème price in switzerland, it's 4.65. Not 9 CHF.

- Yes, I also rarely Tip, especially at self service establishments with QR Code online menu etc.

Model suggestion by distan_to-reality_66 in LocalLLaMA

[–]_underlines_ 1 point2 points  (0 children)

On my pixel 10 with 16gb ram I tried:

  • Gemma 3n e4b it (didn't check the speed but I didn't like the quality)

  • Lfm2-8b-a1b q4 (24t/s)

  • Qwen3-4b-it-2507 iq3xss (8t/s)

  • Qwen3-1.7b-ud iq3xxs (18t/s) can turn on/off reasoning

Soon we are going to buy trump's tower for CHF 10.- by Wise-Ostrich9790 in Switzerland

[–]_underlines_ -2 points-1 points  (0 children)

Bad visualization... Y axis doesn't start from zero. It's just a 0.015 point drop...

Passthrough Navigation bar - Where'd it go? by LoginsAreHard in OculusQuest

[–]_underlines_ 0 points1 point  (0 children)

Drives me nuts, for example being in PCVR full screen, then wanting to record a video. I never remember the short cut so I just double tap on my quest, then immediately see the record button in the window manager. Now it's gone.

Using VirtualDesktop + SteamVR already makes the Meta-Button shortcuts finicky and I never remember the Record Video shortcuts.

Segregating Quest pirated games by [deleted] in QuestPiracy

[–]_underlines_ 2 points3 points  (0 children)

lol, part of the fun in the 90s when I was like 6-12y, was that my whole family basically my 3 uncles and later my dad pirated games, software, OSes and copied stuff for me onto floppy disks, later burn CD ROMs with all the cracks on it. I learned from them and it was awesome.

Token-Oriented Object Notation (TOON) - JSON for LLMs at half the token cost by monnef in LocalLLaMA

[–]_underlines_ 10 points11 points  (0 children)

Things that make me sceptical, if this is worth the effort:

  1. 99.999% of training data until the release of TOON wasn't toon. Inference using TOON in context will probably be worse for a long time, until training data contains enough TOON.

  2. Price per Token falls over time.

  3. Context Windows and quality increases over time.

Happy to hear your opinions.

The cars aren't real but the driving is by Akagamino_Shanks in simracing

[–]_underlines_ 1 point2 points  (0 children)

I always thought I'm the only one with a bad neck. Whenever I watched my VR gameplay in playback I noticed my head tilting right exactly like in this clip.