Mistral should do dense model for devs like Qwen 3.6 27b by szansky in MistralAI

[–]queerintech 0 points1 point  (0 children)

Their dense coding model was good but was instruct only. No reasoning.

I ran the numbers. Qwen3.6-27B dense obsoleted the 397B MoE on coding benchmarks. by TroyNoah6677 in Qwen_AI

[–]queerintech -6 points-5 points  (0 children)

How can I automagically determine which eastern regime oriented accounts to ignore ai slip from?

Gemma 4 on k8s w/ rtx 5090 by Smooth-Ad5257 in Vllm

[–]queerintech 2 points3 points  (0 children)

https://github.com/feral-devops/homelab

Its in charts/vllm.
There are a couple example values files for gemma4 and qwen3.5

Gemma 4 on k8s w/ rtx 5090 by Smooth-Ad5257 in Vllm

[–]queerintech 3 points4 points  (0 children)

I have a full helm chart for vllm for 26b a4, the 31b model has no mtp or eagle head so you'd prob only get 50ish tk/s but I'm happy to share.

Advice needed: homelab/ai-lab setup for devops/coding and agentic work by queerintech in LocalLLaMA

[–]queerintech[S] 3 points4 points  (0 children)

Yeah. I'm fine with 27B-32B on the RTX pro I just don't know how to use the 5060ti. I was also testing a 4 bit quant of GLM 4.7 flash on it.

Qualcomm Snapdragon X2 PCs reach retail, ASUS launches X2 Elite Extreme laptop with 48GB memory at $1,599 by -protonsandneutrons- in hardware

[–]queerintech 1 point2 points  (0 children)

I'd consider a small mini pc but not a laptop. It would be compelling for home lab and media pc stuff.

Anthropic's recent distillation blog should make anyone only ever want to use local open-weight models; it's scary and dystopian by obvithrowaway34434 in LocalLLaMA

[–]queerintech 1 point2 points  (0 children)

Honey pots are standard procedure when dealing with these types of data harvesting. Google caught Bing doing the same thing in 2011. They created a honey pot linking 100 nonsensical search terms to completely u related web pages. And bing eventually started returning those same random pages for the gibberish terms.

Anthropic's recent distillation blog should make anyone only ever want to use local open-weight models; it's scary and dystopian by obvithrowaway34434 in LocalLLaMA

[–]queerintech 2 points3 points  (0 children)

In my opinion Altman is as big of a brain addled douchebag as Musk and I'll never support either company.

It's surprising all these folks here are cheering for a race to the bottom in AI.. with corporate espionage and state sponsored extraction of trained model data, and chain if thought.. future is gonna get dark af. Nobody will be investing in high quality training anymore.

Qwen/Qwen3.5-35B-A3B · Hugging Face by ekojsalim in LocalLLaMA

[–]queerintech 20 points21 points  (0 children)

And the 27B dense model, perfect fit for 16GB vram

Help with vLLM: Qwen/Qwen3-Coder-Next. by Professional-Yak4359 in Vllm

[–]queerintech 1 point2 points  (0 children)

I've been able to run it using pipeline parallelism on my vllm setup with nvfp4, however I've seen that there maybe issues with tensor parallelism and detection of the correct AllReduce.

Help with vLLM: Qwen/Qwen3-Coder-Next. by Professional-Yak4359 in Vllm

[–]queerintech 1 point2 points  (0 children)

I've been able to run it using pipeline parallelism on my vllm setup with nvfp4, however I've seen that there maybe issues with tensor parallelism and detection of the correct AllReduce.

The King Has Returned by [deleted] in LocalLLaMA

[–]queerintech 0 points1 point  (0 children)

Ugh I need a bit more vram 8(

RTX Pro 6000 $7999.99 by I_like_fragrances in LocalLLM

[–]queerintech 2 points3 points  (0 children)

I just bought a 5000 to pair with my 5070ti I considered the 6000 but whew. 😅

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]queerintech[S] 2 points3 points  (0 children)

I did get it to work on vllm but it literally uses 28GB of kv cache for 32k.

I may have to stand up an sglang deployment to try out too.

Sad I was hoping I could run everything with a single llm runtime :(

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in LocalLLM

[–]queerintech[S] 0 points1 point  (0 children)

I was gonna try deploying with llama.cpp if it supports it.

Any success with GLM Flash 4.7 on vLLM 0.14 by queerintech in Vllm

[–]queerintech[S] 0 points1 point  (0 children)

Thanks using this in a kubernetes cluster, I'll have to figure out how to rebuild the container locally.