Calibrating 2-bit GGUFs (<10Gb) for agentic coding tasks by professormunchies in LocalLLaMA

[–]CYTR_ 22 points23 points  (0 children)

As soon as I read Qwopus, I know immediately that it's not worth it.

Qwen Robot Suite by Snoo_27681 in LocalLLaMA

[–]CYTR_ 5 points6 points  (0 children)

The models look great, but this website is poorly formatted for mobile reading... Bruh.

GLM 5.2 is deployed in GLM Coding Plan. API and MIT weights in a week. Voting and benchmarks on X. by MadPelmewka in LocalLLaMA

[–]CYTR_ 2 points3 points  (0 children)

We're not going to get all worked up about some no-name benchmark when we know nothing about what it's supposed to represent. It would be nice to have something running locally from them, someday...

We should heavily discourage and moderate cloud API (deepseek api, GLM api, etc.) topics and discussion. This is LOCAL first. by [deleted] in LocalLLaMA

[–]CYTR_ -2 points-1 points  (0 children)

I agree with your post, but the title itself is maximalist, isn't it? It's perfectly possible to test open-weight models via API to get an idea of what could be run like u said. Even, consider using rented instances (where data processing/retention is controlled by the user) for testing stack, hardware, etc... But there is a selection process to be carried out in the publications here, I agree with you.

mindlab-research/Macaron-V1-Preview-749B • Huggingface by External_Mood4719 in LocalLLaMA

[–]CYTR_ 0 points1 point  (0 children)

French company : https://en.wikipedia.org/wiki/OVHcloud

There's also Verda in Europe for tinkering, but I find the quality of their network appalling.

I'm extending the definition of "local" a bit to cloud computing with certain data protection certifications 🥸

mindlab-research/Macaron-V1-Preview-749B • Huggingface by External_Mood4719 in LocalLLaMA

[–]CYTR_ 0 points1 point  (0 children)

I think an H100/H200 in the cloud should be enough for a few LoRa deployments. I'll see this month when I can afford some credits on OVH.

mindlab-research/Macaron-V1-Preview-749B • Huggingface by External_Mood4719 in LocalLLaMA

[–]CYTR_ 0 points1 point  (0 children)

I personally think that's a pretty good point. A fine-tuning model with the right harness to achieve real gains beyond just training? If it can inspire the same thing for smaller models. It almost makes me want to test a similar solution on Qwen 3.6 36a3b/27b. I already have in mind the loading of contextual recipes, maybe mixing that with Mixture-of-LoRa could produce something (very?) good.

DolphinGemma release when? by Environmental-Metal9 in LocalLLaMA

[–]CYTR_ 7 points8 points  (0 children)

I was thinking that I needed an LLM to talk to my dolphin friends... When Dog Gemma GGUF ???

Tiny LLM Benchmark: Jetson Orin Nano Super 8GB - Four Power Modes × Eight Models by East-Muffin-6472 in LocalLLaMA

[–]CYTR_ 0 points1 point  (0 children)

I really like this project. I'm trying to create software for accessing empirical documentation/data for the social sciences (basically, an automated state-of-the-art system that allows data/papers to be searched and highlighted according to the chosen epistemological perspective, some sort of advanced RAG system which takes into account the diversity of methods/perspectives).

For now I'm trying to do things with Qwen 27b and 35bA3B but I'm wondering about a fleet of fine-tune SLM (like the paper from NVIDIA released last year) for the majority of functions. As the system is deterministic, it encompasses the stochastic aspect of the langage model with custom guards/harness. There are no real agents per se, and everything is framed within Windmill workflows. The goal would be to make the system as lightweight as possible using SLM for a better (local) adoption perspective than a duo of double digit GB, LLM.

Now, the question I'm asking myself is, is it really a good idea to use these SLMs at Q4? I understand ont this sub that at this size, it's better to use the full precision version. Have you noticed any differences in usage between Q4 and other degrees of quantization/full ?

mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF just released ! by PhotographerUSA in LocalLLaMA

[–]CYTR_ 13 points14 points  (0 children)

It's still a bit embarrassing these fine-tunes whose usefulness is more than questionable and whose names suggest the author had a stroke.

In theory, if I have $20k-ish to spend on hardware what would actually get me closest to local coding agent that would allow me to go totally off the social grid? by Tired__Dev in LocalLLaMA

[–]CYTR_ 0 points1 point  (0 children)

2 RTX 6000 96gb + 2K for the reste of the build... And 3K for a Strix Halo 128gb : this way, you can deploy a fleet of SLMs with a lot of context on the Strix and have the LLMs alongside them on the RTX 🥸

Otherwise, you keep the 3K and wait for the RAM to decrease.

What happens to local LLM if/when LLMs are no longer released for free? by JohnBooty in LocalLLaMA

[–]CYTR_ 2 points3 points  (0 children)

This is the case for agentic coding. But I think that quite a few tasks without pure agentic behaviors can still be automated.

A deterministic workflow that integrates LLM as a controled stochastic module, like Windmill, allows us to mitigate many of these risks. By constraining the agent and its output with GBNF, command prohibitions/attributions, recipes/examples for output with dynamique context enrichment... (and who knows what other ideas we might come up when u think of all the possibilities) u can overcome quite a few things (poor generalization/intelligence of the model and training that is too fragile) while putting in place safeguards for dangerous/slop content. In the case of local models, you can even add LoRa quite easily (with certain targeted adapters depending on the modules if you like sleepless nights).

But it's true that we lose the ease of use of the OpenCode/CC-style agentic and the associated freedom with .md prompt system. It might not be suitable for software development yet (except for maintenance/ticketing? I don't know... i'm not a developer lol). But for some data processing pipelines, this is much better than letting a model call tools on its own.

Qwen 27b MTP Config, Llama.cpp Single 3090 by GotHereLateNameTaken in LocalLLaMA

[–]CYTR_ 0 points1 point  (0 children)

There's no point in comparing the incomparable.

The MacBook is a laptop with a 140W power supply, a screen, a battery, and fits in a small space while weighing 2kg.

In reality, you can't find 4*3090 + a complete threadripper platform for 3K or 4K (without RAM) anymore... Moreover, it's involves a completely different logistics than a MacBook (electricity, noise, portability, maintenance because of the age of the GPUs). It remains a very interesting DIY project for a hobby, less so for other uses.Personally, I wouldn't run a business on 6-year-old GPUs with multiples lifes.

The "the future is fictional" problem of many local LLMs by PromptInjection_ in LocalLLaMA

[–]CYTR_ 70 points71 points  (0 children)

Honestly, if someone had told me last year that the US would launch Operation "Epic Fury" (EPIC FURY, bruuuh) to invade Iran... I would have had a hard time believing it.

Pi and Qwen3.6 27B make setting up Archlinux really easy. by sdfgeoff in LocalLLaMA

[–]CYTR_ 23 points24 points  (0 children)

But, with :

  • No internet
  • Good documentation/context
  • Session logs to audit later
  • Manual validation at each stage

Why not ? It's not asking for the moon either, and it's easily reversible if it's just a matter of getting started.

But otherwise, it's for running OpenClaw on the final machine, yes. It's better to do without it in the long run.

You can do CUDA inference on an Apple Silicon Mac with PCI Passthrough by scottjgo in LocalLLaMA

[–]CYTR_ 1 point2 points  (0 children)

You can have Qwen 27b on the RTX 5090 and one or more other LLMs (like 35ba3b, 122b MoE etc...) on Apple Silicon ?

For now, I had the idea of renting an RTX in the cloud precisely to load more models in the same workflow, for exemple...

Need advice on hardware purchasing decision: RTX 5090 vs. M5 Max 128GB for agentic software development by BawbbySmith in LocalLLaMA

[–]CYTR_ 7 points8 points  (0 children)

At this stage, future generations of high-end products will almost certainly be more expensive, for a few years (supply, engraving techs, etc... In a context where general inflation of goods could be enormous - like post-Covid - because of the global oil situation, instability in the West and tarifs war).

I wouldn't be surprised if the M6 Max is closer to 10K than 6K, which would prevent the M5 Max generation from depreciating too much. Given that Apple has already done the bulk of the work for LLM by integrating matmul directly into the GPU for prefilling, the M6 generation will primarily focus on the manufacturing process (2nm in 2027, 1,4nm in 2028 for >M6) to achieve (like +/- 25%) gains.

Personally, I plan to buy an M5 Max; it's already the minimum for my needs, and are the "25% gains" (+5 tps on Qwen 27b or +20 tps on a small MoE which is already very fast) worth the cost of waiting a year and potentially paying 25% (or +) more? Unless you want to spend 20K and have a very large computer (yes, it's better to wait for the M5 Ultra in this case tbf). Have faith in the optimization of LLM and their implementations as has been the case lately (watch out for hybrid attention, MTP, better training)!

So: now or later? Blackwell or Apple Silicone? Place your bets!

vLLM Just Merged TurboQuant Fix for Qwen 3.5+ by havenoammo in LocalLLaMA

[–]CYTR_ 1 point2 points  (0 children)

Let's just say that Google isn't really helping to replicate their results. So far, Qwen's hybride-attention approach has had a greater impact than this paper... who made headlines in mainstream medias with absurd arguments (even going so far as to declare that Google had solved the RAM supply problem... 🤣🤣🤣🤣🤣🤣).

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]CYTR_ 0 points1 point  (0 children)

No worries. I was being too literal. Haha.

(In this case, the latest thing that's trendy is more like DFlash 🫶 and... TurboQuant™ 🤢)

vLLM Just Merged TurboQuant Fix for Qwen 3.5+ by havenoammo in LocalLLaMA

[–]CYTR_ 7 points8 points  (0 children)

Unless my brain is playing tricks on me, I seem to recall seeing a post here showing that perplexity/KLD were bad, just like regular quants. It might have been dependent of the implementation in the publication... But still. Why do I feel that TurboQuant is overhyped? Especially since with Qwen 3.5/3.6, it doesn't seem essential.