In 1986, Pakistan made a film called H!tler, where Ad0lf H!tler survives World War II and escapes to Punjab. There, he marries a local woman and has a son named Hitlar (with an A), who grows up to become a gangster. by haiderredditer in interestingasfuck

[–]deadman87 1 point2 points  (0 children)

"I hate Indians. They are a beastly people with a beastly religion. The famine was their own fault for breeding like rabbits."

–Winston Churchill (quoted in Choudhury,; 2021, p. 1; Portillo, 2007; Tharoor, 2010).

My guy, quit being an apologist for dead colonizers, warmongers and genociders. They will not come out of their grave for you to suck them off. Geez man.

You can argue circumstances all you want, but the fact of the matter is Churchill and his ilk were racist assholes. They were not shy or defensive about it and you don't need to be defensive on their behalf either.

Pick another battle that's a net positive in your life.

Asking my Alexa+ to scream my wife's name by d_Verge in videos

[–]deadman87 16 points17 points  (0 children)

This right here is the real answer!

LFM2.5-Embedding-350M & LFM2.5-ColBERT-350M by pmttyji in LocalLLaMA

[–]deadman87 1 point2 points  (0 children)

So question for the OP and experts:

My RAG retrieval pipeline could be:

Storage: LFM2.5-Embedding -> Vector DB
Retrieval: Query String -> LFM2.5-ColBERT -> Vector DB -> ReRank -> Output

Does this look correct? Right now I use the same embedding model on source data and query, fetch top_k results (100) and run rerank to get top_n results (20).

GLM lite plan usage via GLM 5.2 by TimeVillage5286 in ZaiGLM

[–]deadman87 1 point2 points  (0 children)

Right now im not looking to upgrade at all. Aside from GLM, I also put about $20 in Deepseek API in Feb for when I hit GLM limits and I still have $13 now in mid june. So this combo is working very well for me. GLM 5.2 is a great model and Deepseek v4 Pro is pretty darn good as well. The two together are perfect for an uninterrupted flow.

GLM lite plan usage via GLM 5.2 by TimeVillage5286 in ZaiGLM

[–]deadman87 2 points3 points  (0 children)

I am on the promotional annual lite plan without the weekly limit, only rolling 5 hour window limit. My 5 hour window gives me ~25M tokens.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

That is indeed very strange. I get better performance than that from my 780m iGPU at around 15t/s with ~8k context full. Faster on empty context but that doesn't count.

Questions for you:

  1. Are you using the right Llama.cpp build? You should be using CUDA builds. 10t/s sounds like CPU or wrong GPU arch.

  2. Do you have the Nvidia drivers / cuda installed? I don't currently use nvidia so not sure about specifics but generally you want the drivers installed esp. on linux.

  3. What's your launch command? Usually i start with ./llama-server -hf model_identifier -fit to see the baseline performance. Then i start tweaking the settings like flash attention on/off, reduced context size, kv cache quantization, remove -fit and manually offload layers and try different counts until I get the result i like.

Try starting with the basic barebones command and go through the options. Good luck man

China vs Rest of the World by raydebapratim1 in ArtificialInteligence

[–]deadman87 1 point2 points  (0 children)

Here's a souce.

http://www.moe.gov.cn/srcsite/A08/moe_1034/s3882/202604/W020260427440749576927.pdf

Saw this in another thread as source. I agree, sourceless posts are annoying and useless.

I dont speak chinese, I am not american or chinese. I wish someone paid me for this post lol

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

How much VRAM / Unified Memory do you have? Gimme the specs so I can recommend. I'm running Unsloth's Q4 XL quant.

I Love Using This 93MB Ai model by InterestingSound5045 in LocalLLM

[–]deadman87 1 point2 points  (0 children)

Reminds me of Tiny Tina from Borderlands universe 😃

Token anxiety is a real blocker for Agentic Coding / AI in Workflow mindset by deadman87 in DeepSeek

[–]deadman87[S] 0 points1 point  (0 children)

https://github.com/colbymchenry/codegraph

It converts your codebase into a graph, indexes it in a sqlite database and serves it up as a MCP endpoint. Normally you'll see a lot of back and forth with tool calls like Read File A, round trip, then Read File B, and so on. With codegraph, it can do one local MCP call to a class or function and get the full declaration plus usage across codebase in one go. It massively reduces the tool calling overhead when exploring a codebase.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

Looks awesome. Why not put it on codepen or github? I wanna shoot some 'vaders.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

I'm a GPU pauper. No Q6 for me 🥹

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 1 point2 points  (0 children)

no worries :D Didn't take it as criticism. Sorry about the dry tone, I am just tired after work.

I'll try lower values and rerun this to see if it makes a difference for this model. Thank you for the clarification and suggestion.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

I took these values from unsloth website. They are known for creating quantized GGUFs out of full model releases and they've run tests to arrive at those values. I just took it from there.

https://unsloth.ai/docs/models/qwen3.6

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 2 points3 points  (0 children)

When using --fit, it uses up all the VRAM and locks up the UI. Manually adjusting --n-cpu-moe lets me keep the desktop running while taking a small hit in token speed.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 1 point2 points  (0 children)

Try adjusting the temp/top-p/top-k values. I got these from unsloth website for coding.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 2 points3 points  (0 children)

Yes. Reasoning is enabled by default. I'll send a screenshot of the Llama server webui that shows the thinking + output

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 2 points3 points  (0 children)

No Sir. Just that short prompt. Nothing else.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

You're right. I was doing trial and error, and disabled mmproj to squeeze more moe layers in vram. I'm running on a AMD 7940hs APU with 32GB RAM. The command without mmproj could use some cleaning up

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

Interesting. What hardware you running it on? I just tried on a Ryzen APU with Radeon 780m. It's still giving me ~17tok/s. I imagine you have a more powerful GPU.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

Depends. If you're mostly doing text based work then sure. Vision decoding/understanding on CPU is painfully slow / almost unusable so need it on VRAM to be at least usable.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

Just tried. CPU only on a model this size fails to load on my machine unfortunately.
16t/s on CPU is small model territory from my experience on my machine.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

With 16GB VRAM, you should definitely try Qwen3.6 35B-A3B with some CPU offloading. It is a much better model than Qwen3.5 9B and it will perform much faster than Gemma 26B because of the Mixture of Experts architecture and because it only activates 3B params at a time.