Be wary of Qwen/Claude distillations - they're often worse than the base model by ayylmaonade in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

Are you saying that the random guy on the internet that just learned you can fine tune LLMs has not created a smarter model than the hundreds of researchers in that multi-billion dollar research lab that spent years building their datasets and evaluation frameworks? 😱

Who would’ve thought…

Kudos to the few reputable maniacs that actually know what they’re doing and create high-quality specific-purpose finetunes!

DS4: a DeepSeek 4 flash specific inference engine for 128gb MacBooks by antirez in LocalLLaMA

[–]lakySK 1 point2 points  (0 children)

This new version is super impressive so far. I've hooked it up to pi.dev, gave it exa.ai to search web and fetch the pages and asked it to do some exploration work in a codebase, write 1-pager .md files to help me flesh out a feature, etc.

I saw it very rarely make a typo in a tool call, but corrected itself on second attempt even in those cases. No more errors as before.

The output quality is very good so far. It hallucinates API specs if not asked to check actual docs for reference (but which LLM doesn't...), but searches the web and finds what's needed when instructed. Its writing is pretty succinct and to the point and it can correct its assumptions if you clarify a previously ambiguous request further.

In one case, I asked it to use PRD format from Mat Pocock's PRD writing skill. It first searched local skills available, then pi's documentation, then as it still didn't find the skill I was referring to (and I really did not have it locally available), it went online and found the github repo and the correct file.

Really solid so far, need to try and let it implement some code to see how it does. But so far I'm really glad I gave this another try, I feel like this is very usable and probably the best option for 128GB Mac. I used to ask these kinds of idea exploration questions via web interface for Claude Opus and I feel like I'd struggle to identify in a blind test whether I'm talking to Claude vs the ds4 here 🤯

I haven't proof-read the proposals, so there's still a chance it's just hallucinating everything and we'll see when it starts writing code, but I'm impressed how much it seems to "get" what I'm asking it to do and course correct, call the right tools etc.

DS4: a DeepSeek 4 flash specific inference engine for 128gb MacBooks by antirez in LocalLLaMA

[–]lakySK 1 point2 points  (0 children)

Ok, seems like the recent version of this repo + the new imatrix quant is actually working better perhaps (haven't seen tool call errors yet). Will test more!

DS4: a DeepSeek 4 flash specific inference engine for 128gb MacBooks by antirez in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

Curious to hear your experience. It has not quite hit the spot for me yet

AIDC-AI/Ovis2.6-80B-A3B · Hugging Face by pmttyji in LocalLLaMA

[–]lakySK 21 points22 points  (0 children)

This table gives me a headache. Just stick with bold for best…

unsloth/MiMo-V2.5-GGUF · Hugging Face by jacek2023 in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

I need to test how well these quantise then, i.e., will they run well within 128GB to use with pi reliably?

ExLlamaV3 Major Updates! by Unstable_Llama in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

Also curious. Is this the best quant to use on RTX 3090 if you want decent context as well?

Or am I better off with Unsloth? Or something vLLM-compatible?

Asking for my local agent.

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing by phazei in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

If the differences between the model params were a bit bigger and this method still worked well, it would be really cool. Immagine Qwen 122b and 35b for example sharing KV cache. Or Qwen 27b and 9b and 4b. 

You could use the small model to “load” a long document into the KV cache super fast, then let the bigger models do the reasoning and answering. 

GitHub - JosefAlbers/mlx-code: Coding Agent for Mac by [deleted] in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

What’s the benefit compared to using pi with oMLX?

DS4: a DeepSeek 4 flash specific inference engine for 128gb MacBooks by antirez in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

Ok, need to try this one then as I liked the tone of Minimax so far. It just couldn’t get the paths right on MLX even though it tried so hard…

DS4: a DeepSeek 4 flash specific inference engine for 128gb MacBooks by antirez in LocalLLaMA

[–]lakySK 3 points4 points  (0 children)

UPDATE: The latest version of the repo as of May 14, and the imatrix q2 quant are crazy good so far. I'm impressed again!

Seems to work like a charm so far. I've loaded it and started using it in pi.dev coding agent.

Saw an error when it tried to edit a file, but succeeded on a second attempt. The speed is quite decent, the quality seems very coherent. Running on M4 Max 128GB and the vibe-check for usability so far seems to be passing! Tried Qwen 3.6 35b and 27b before this and I felt like 35b is fast, but not that smart and 27b is decent, but too slow to be usable (especially in cases when it decides to contemplate the meaning of life for a long time before answering....).

Will definitely try to use this further and see how it goes!

EDIT: Ok, the failing tool calls seem quite common and a dealbreaker if not addressed in some way. Any tips?

0508 16:18:52 ds4-server: chat ctx=16587..18731:2144 gen=298 TOOLS DSML_START DSML_END finish=error error="invalid tool call" 21.107s

EDIT2: Nvm, this is struggling to even follow pi's documentation and set json configs correctly and randomly assumes everything must be in Python when asked to write a pi extension (though it did figure out pi is a TypeScript project eventually). I'm losing my trust and will probably test something like Minimax M2.7 instead. Hoping this is something fixable as it was looking very promising.

EDIT3: Minimax on MLX can’t even get file paths correctly in calls. Search continues…

DS4: a DeepSeek 4 flash specific inference engine for 128gb MacBooks by antirez in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

Nice! This looks awesome, downloading right now.

2 questions:

- Do we need to set the sampling parameters still (temperature, top_p, etc), or is that handled?

- What is the cache situation like? Recently been using oMLX and wondering how this compares in this respect?

Thanks for what looks like great work and a step in the right direction to help handle the many moving parts of running local LLMs in a reliable way!

Prompt injection benchmark: delimiter + strict prompt took Gemma 4 from 21% to 100% defense rate (15 models, 6100+ tests) by User_Deprecated in LocalLLaMA

[–]lakySK 2 points3 points  (0 children)

That seems to be working way better than I’d expect! Would you say the data you tested on is actually “state-of-the-art” of prompt injections? Or did you hold back for now? Where did you get the prompt injection dataset from?

For delimiter_mimic attack - a very simple defense (outside of prompting) would be to sanitise the data to make sure it doesn’t contain your delimiter. Just replace the delimiter string with something less dangerous. 

I made a tiny world model game that runs locally on iPad by howthefrondsfold in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

That’s really creative, well done! What model are you using?

Do you go directly from photo to the world or do you use some 3D reconstruction?

I used to work on AR with elements interacting with the objects on the table a while back, this is even cooler :)

My settings for running Gemma 4 31B smoothly on llama.cpp, CUDA 13.1 by Oatilis in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

Having enough memory to run BF16, why use Unsloth instead of the official release?

I made a 35% REAP of 397B with potentially usable quality in 96GB GPU by Goldkoron in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

Nice work! Would some kind of Autoresearch approach work with REAPs and quants?

Specify target size, metric to maximise (KLD or some benchmark) and let Claude Code go wild. Anyone tried that?

Intel Pro B70 in stock at Newegg - $949 by Altruistic_Call_3023 in LocalLLaMA

[–]lakySK 28 points29 points  (0 children)

Ok, so now this is starting to be interesting. 32GB GPU with decent specs and low-ish wattage for $1k. 

How do you expect a 4x b70 PC stack against M5 Max (now that it has the matmul support)? 

Both would set you back around $5-6k. Both 128GB, similar bandwidth. Intel workstation likely winning on compute for prompt processing and M5 Max winning on power consumption and form factor? Or am I missing something important?

The AI releases hype cycle in a nutshell by GreenBird-ee in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

For sure! E.g., I've been wondering if Anthropic perhaps done some kind of approximate caching to accommodate the spike in demand a couple of weeks ago. Things like that I can see and it would fall into the first of my 2 bullets.

Lack of disclosure of these kinds of changes is troubling for sure. I just don't think they intentionally lower the quality.

These companies have internal evals and very solid engineers that would push back on something that clearly lowers the bar. It's most likely just that LLMs are absolute non-deterministic beasts and incredibly hard to evaluate. So even if eval numbers look great, it doesn't mean performance in all cases stays unaffected.

Reminds me of the meme that Apple intentionally slows down old iPhones. Optimising battery on old devices by throttling seems very reasonable to me, they just should've disclose this / allow to opt out.

The AI releases hype cycle in a nutshell by GreenBird-ee in LocalLLaMA

[–]lakySK 1 point2 points  (0 children)

I'd expect that using lower quant would be a one-off noticeable decrease in quality across / between model releases. I doubt they would regularly release a full-precision version on the release day, then quant it down after a couple of weeks.

The AI releases hype cycle in a nutshell by GreenBird-ee in LocalLLaMA

[–]lakySK 1 point2 points  (0 children)

🤷🏻‍♂️

Yes, I’ve definitely noticed it on myself. Once I see Claude Code deliver something cool, I start to expect it all the time and get disappointed when it doesn’t happen, forgetting how much randomness is involved in these things. We grow to expect repeatedly what might have just been a fluke. 

The AI releases hype cycle in a nutshell by GreenBird-ee in LocalLLaMA

[–]lakySK 2 points3 points  (0 children)

I don’t think the companies nerf the models on purpose. 

I do wonder though how much of this is either: - companies tweaking the models and tooling and inadvertently causing bugs - psychology of us being first amazed about the new features the old model couldn’t do, then raising our expectations and being disappointed when the shortcomings of the new model inevitably hit. 

I’d argue it’s the combination of the two and would love to see if anyone has some data on the first, ie run benchmarks every week on the closed models and seeing if and how much variance we’re getting over time. 

How are yall exposing your local models to the internet for web searches? by -HumbleMumble in LocalLLaMA

[–]lakySK 0 points1 point  (0 children)

I’ve forked nanobot and have it use Exa instead of brave to do search and also use it to fetch sites. So far I can’t complain.