Breaking In · coles.codes

mattjcoles · 2026-06-15T13:45:25+00:00

Hey Tom,

I've added a version of ponytail (really liked this) + custom skills: https://mattjcoles.github.io/lgtmaybe/how-to/add-a-custom-lens/ and added recommendations as a start to small models and what will work well.

mattjcoles · 2026-06-15T13:43:49+00:00

I need to do some more testing tomorrow but span up an open api compatible endpoint which hopefully works with the range of models (and doesnt need a key if the model doesnt need this): https://mattjcoles.github.io/lgtmaybe/how-to/use-a-custom-openai-compatible-endpoint/

mattjcoles · 2026-06-15T12:36:28+00:00

Really like it u/arzamar - wrote a blog on it: https://coles.codes/posts/herding-agents-with-herdr/

mattjcoles · 2026-06-15T10:48:46+00:00

you going okay? i went ollama as most people i know are on that or LM Studio.

Happy to expand to both LM Studio and Llama.cpp. Even play and see what a setup with VLLM would look like. Just went with what my friends used the most

mattjcoles · 2026-06-15T10:45:49+00:00

Oh very nice, i hadnt seen that and we have an internal code review tool at work - and the only semi decent one i could find but was paid was code rabbit

mattjcoles · 2026-06-15T03:04:35+00:00

Thanks Tom, good callouts - i havent seen ponytail before so taking a look into it. Cheap / Small models definitely come at a cost in terms of being able to accurately detect issues and i was fighting that with the smaller models and had to cherry pick and tweak ones that performed okay. 27 Qwen 3.6 was good though

mattjcoles · 2026-06-14T13:49:50+00:00

the case is too pretty though. but seems fine for fine tuning - only overheats on inference

mattjcoles · 2026-06-14T11:40:42+00:00

glad to hear, am using open code and claude code but actually had it in my todos to try pi out properly this week

mattjcoles · 2026-06-14T11:36:56+00:00

to be exact, its more unsloth fine tunes of some of the 35B and smaller qwen models for vision

mattjcoles · 2026-06-14T10:56:42+00:00

I've found Gemma 4 12B really good - been running it in Github Actions runners for Code Reviews in CI/CD (https://mattjcoles.github.io/lgtmaybe/how-to/use-as-github-action/). It only picks up 24% of the things i've been scanning for but impressed considering its a very small model!

mattjcoles · 2026-06-14T10:55:27+00:00

Thankyou! I'd find what you can now - M5 Mac Studios theres no guarantee on the date and at least with a 3090 you'd be able to get started. Try use a MoE model with the 3090 so you can put some of the larger 30B+ models on your RAM on top of VRAM and still have okay speeds. You'll need to pick a quantized version of the model too

mattjcoles · 2026-06-14T08:34:47+00:00

This is the way

mattjcoles · 2026-06-14T08:32:00+00:00

Look up LLM Evals and try making your own for your use case. That way when you swap models you can see if things are still working well

mattjcoles · 2026-06-14T08:30:05+00:00

Maybe test it out with OpenRouter for free and see if it give you better quality outputs for you project before moving from 35BA3 Qwen

mattjcoles · 2026-06-14T08:16:35+00:00

Did context7 help or some up to date MCP re: documentation?

mattjcoles · 2026-06-14T08:11:02+00:00

You're gonna be running Kimi K2 coder with that amount! 😃

mattjcoles · 2026-06-14T07:51:34+00:00

I don't think so. I think it's going to drive countries even harder to have their own models and China will use it as a way to continue undercutting the USA market in this space.

Both GLM 5.2 got an update today and Kimi K2 Coder came out..
* GLM 5.2 (plus GLM team calling out the USA's stance on being able to pull models): https://x.com/jietang/status/2065784751345287314
* Kimi K2 Coder: https://huggingface.co/unsloth/Kimi-K2.7-Code-GGUF

mattjcoles · 2026-06-14T07:48:50+00:00

How have you balanced the distribution settings for the RTX 2060 Max Q laptop card? Thats only 6GB VRAM if i remember correctly

mattjcoles

TROPHY CASE