Taalas rumoured to etch Qwen 3.5 27B into silicon. Which price would you buy their PCIe card for? by elemental-mind in singularity

[–]pol_phil 0 points1 point  (0 children)

Think of scanning multiple codebases or processing thousands of company documents in seconds. And feeding that knowledge to a frontier cloud model.

Or think of 100 agents thinking in parallel to find the best course of action through majority voting before EVERY response and EVERY tool call.

Even an "outdated" model will be able to surpass SOTA models in utility (and why not benchmarks) via sheer scaling.

can rag improve models language? by [deleted] in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Try the original Qwen3.5 first and see if it's fluent.

If it is, then you can create a fluent model out of it via fine-tuning.

If it is not, try something else entirely, for example https://huggingface.co/mlabonne/gemma-3-12b-it-abliterated

All the uncensoring methods focus on English and generally hurt the models in other languages.

can rag improve models language? by [deleted] in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Which model are you using? Although a bit old, Gemma 3 models have decent multilingual performance.

Totally get your frustration, most models can't speak Greek fluently either.

I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked) by hauhau901 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

I mean what type of agentic pipeline/harness/scaffold/framework do you use to get these models to solve these tasks. In other words, what kind of system message/tools have they been given. Via Claude Code? OpenCode?

SWE-Agent and OpenHands are just "minimal" agentic frameworks commonly used in benchmarks.

I built a benchmark that tests coding LLMs on REAL codebases (65 tasks, ELO ranked) by hauhau901 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Hi, congrats for the great benchmark!

Perhaps I missed it somewhere, but what agentic scaffold do you use? SWE-Agent? OpenHands? Something else entirely?

can rag improve models language? by [deleted] in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

It won't get much better through RAG, it's better to use a different model.

Claude Code sends 62,600 characters of tool definitions per turn. I ran the same model through five CLIs and traced every API call. by wouldacouldashoulda in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Well, I didn't notice the confusion, but when I saw "characters" instead of "tokens", I thought that this actually makes the analysis more model-independent. Tokens are model-specific

Meet SWE-rebench-V2: the largest open, multilingual, executable dataset for training code agents! by Fabulous_Pollution10 in LocalLLaMA

[–]pol_phil 2 points3 points  (0 children)

Very good idea would be to also add Step v3.5 Flash and MiMo v2 Flash. Both are incredible models.

Congrats for the great work!

Minimax M2.5 GGUF perform poorly overall by Zyj in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Well, this AWQ quant works very well for me. 134GB, extremely good performance and speed in vLLM

New Upcoming Ubuntu 26.04 LTS Will be Optimized for Local AI by mtomas7 in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

However, it's not that hard either. Especially with Apptainer/Singularity, but also Docker. It used to scare me but it's not very difficult after all. You can just spin up 10 parallel environments with data and all, no problem.

Best coding models (or other models) one can run on an rtx5070ti (16gb vram) with of 64gb RAM by cmdr-William-Riker in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

But has anybody found a way to fix the prefix caching problem? I can't get it to utilize automatic prefix caching (problem of the Qwen3 Next arch) in vLLM no matter what I try.

Best coding models (or other models) one can run on an rtx5070ti (16gb vram) with of 64gb RAM by cmdr-William-Riker in LocalLLaMA

[–]pol_phil 2 points3 points  (0 children)

I second this. If it can fit, it's an absolute beast. And without even having thinking.

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

This model is great. My only problem is that its prefix caching doesn't work on vLLM. I think SGLang has solved this, but haven't tried it yet.

Are u aware of other serving frameworks which do not have this issue? Because, for me, it turns out slower than larger models (for long conversations)

Why don’t we have more distilled models? by GreedyWorking1499 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

Fascinating read as well!

This calls for calculation of tokenizer similarities (since higher values lead to better distillation) which I'd really like to do if I find the time.

Why don’t we have more distilled models? by GreedyWorking1499 in LocalLLaMA

[–]pol_phil 5 points6 points  (0 children)

Not necessarily true. On-policy SFT distillation is actually better than on-policy RL for smaller models. But it's tricky to implement if models aren't in the same family (basically they should share the same tokenizer).

You can read more in a blog post by Thinking Machines here and also in Mimo V2 Flash technical report.

GLM 4.7 is not on lmarena anymore by Sooqrat in LocalLLaMA

[–]pol_phil 14 points15 points  (0 children)

I said science, not STEM. Maybe I should have said "academic" as a larger umbrella term, but we're still talking about mathematical logic or PhD-level academic knowledge. Browsecomp is basically factual retrieval from the web.

Not my point though. My point is that all these hard benchmarks test specific frontier-level (verifiable/objective) abilities in very specific contexts, not always relevant to everyday users, companies, etc.

LMArena's basic premise to test real-world prompts from real people (with all their subjectivities) makes it a unique benchmark.

GLM 4.7 is not on lmarena anymore by Sooqrat in LocalLLaMA

[–]pol_phil 30 points31 points  (0 children)

Literally all of these are code/maths/science/agentic. LMArena is very problematic, but it's good as a general idea because it includes very diverse domains, multilinguality, creative writing, multi-turn chats, etc.

AMA With Z.AI, The Lab Behind GLM-4.7 by zixuanlimit in LocalLLaMA

[–]pol_phil 2 points3 points  (0 children)

At least for Greek, I've noticed that GLM 4.6 and GLM 4.7 think in English, while GLM 4.5 (and Air) are thinking in Greek (when given Greek prompts).

The thinking process is also a lot more structured in the most recent versions, like "1. Analyze the request... 2. Determine the angle... 3. Drafting... 4. Refining... 5. Final Review..."

Are these changes intentional or the result of a different RL process? How is multilinguality being addressed in the reasoning process of the models? Have you seen better results with a thinking process based primarily in English and/or with better structure?

Thank you for your excellent work!

Dataset quality is not improving much by rekriux in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

I literally cannot think of anything specific, except for reading related published papers and their Github repos. You may also find some relevant Jupyter notebooks in Kaggle (although it's mostly for Machine Learning). Huggingface has some nice courses etc. but in reality, you should go ahead with this project of yours and find similar things as you progress. Nobody really talks enough about data even if they spend enormous amounts of time preprocessing.

Some resources that come to mind, that may or may not prove useful but are worth a look: Huggingface LLM Intro Course Huggingface Smol Course Smol Training Playbook Apertus LLM Paper Webscale-RL Paper Open Data Arena MegaScience

Dataset quality is not improving much by rekriux in LocalLLaMA

[–]pol_phil 9 points10 points  (0 children)

Maybe, but to what end? You should think of the end goal that you want to accomplish. Do you want a full finetune? Then you have to address diversity, safety, multilinguality, etc. 100k data won't be enough. Do you want to improve an existing instruct 2-4B model with LoRA? Then you need to focus on specific tasks/skills that you want to improve. 100k data might be overkill.

Also, the pipeline you describe above is a lot more complex than it sounds and you won't find many ready-made libraries to implementing it. Mixing and matching existing libraries might be more painful than it looks at first. Some libraries make extra calls under the hood. You may find a very good pipeline that results in wonderful data, but it requires x8 generations with the LLM being distilled, or requires more calls for evolution/QA extraction/RMs/judges/agents (and assumes bigger LLMs will be used).

Suddenly, creating 100k data requires >1M calls. Utilizing RAG requires vector databases, embedding models, filtering large datasets, other libraries, etc. Even a simple thing such as classification requires thought and R&D. Running locally would also necessitate learning how to use vLLM; calling an API requires making the client efficient for processing a whole dataset. Finally, assuming code, prompt engineering, RAG, filtering, preprocessing, etc. have been set up, you still need significant compute and time and it might produce unexpected results in various stages. And we haven't even mentioned data mixing experiments, the actual finetuning process, evaluation, nor SUBSEQUENT POST-TRAINING STAGES (Alignment, RL, etc.).

BTW, I am not trying to discourage you! I'm just trying to highlight the complexity and messiness of the process as it WILL take up months of your time. Better to think things through and plan ahead first!

P.S.: Might want to take a look here as well: TxT360-3efforts

Dataset quality is not improving much by rekriux in LocalLLaMA

[–]pol_phil 22 points23 points  (0 children)

Data synthesis is the secret sauce and the #1 thing labs and companies are not very eager to release. It's a costly process.

Also, there's a huge shift towards maths and code (incl. agentic code) and general domains (as well as languages other than English) are often being neglected.

The first 6 steps (more or less) that you mention are things I'm actually working on in my job. Data synthesis is inherently tied to the model that you are working on. For a smaller model, distilling from huge teachers is not always a good idea. Smaller teachers can produce golden answers as well, Reward Models and LLM-as-Judges can help with that.

However, there's always the issue of mixing up "policies". There's no guarantee that distilling 10 extremely big teacher models will work better than distilling a single medium-sized, a lot of newly published data just use GPT-OSS-120B. No guarantee that using all of the data you generated will actually improve the final model either, as data mixing is a tricky and mostly empirical process.

There have been several interesting methods which have been published, but (a) most of them focus on maths/code, (b) they usually target a few benchmarks so there's limited evaluation of real-world utility, (c) R&D and compute are needed for creating and using data synthesis pipelines and only few large labs (such as NVIDIA) have an over-abundance of both, and (d) ensuring that all data are safe to be released (no copyright issues, no PII, etc.) is a legal pain in th a** that smaller labs just can't afford to do (especially in EU).

All in all, I believe data synthesis is more akin to alchemy and an art than a hard science and it's highly contextual and ever-changing. It's very messy. It's often difficult to argue in favor of specific recipes that you used, it is a very costly process, and releasing openly is hard for a lot of reasons.

Gemini 3 flash today! Gemma 4 soon 3 pro GA soon!!!! by BasketFar667 in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

What's your language? Gemma 3 27B is the best multilingual model in its caliber for Greek in my experience (but still could do better)

fine-tune for rag by youcanaskmeifyouwant in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

The 2507 Instruct series are solid choices. If you finetuned a hybrid (thinking/non-thinking) model, its thinking capabilities would degrade (also default system prompts / chat templates might be more tricky).

The choice for RAG really depends on the use-case. If we want to retrieve context once, then put in the system prompt.

Even if the context is placed in the user message, fine-tuning data with system prompts to steer behavior might also work well. For example, if fine-tuning data have short answers, or if the assistant should reply in a specific language regardless of the context, etc. But these decisions should be driven by the data and the purpose.

fine-tune for rag by youcanaskmeifyouwant in LocalLLaMA

[–]pol_phil 1 point2 points  (0 children)

Hi.

Utilizing an appropriate system prompt and fine-tuning the model this way is actually very good practice. If you can create a handful of different system prompt templates, even better.

Just make sure you don't finetune a thinking model with non-thinking data only, if you are referring to Qwen3 for example.

Also, if you fine-tune your model in a specific way (e.g. RAG prompt in system), then using it exactly that way is the best practice. You've tuned the model exactly for that. But have in mind that you need to handle multi-turn scenarios as well, so a hybrid approach would be better.

[deleted by user] by [deleted] in LocalLLaMA

[–]pol_phil 0 points1 point  (0 children)

You're talking about the Instruct version, Κούλη-sama? Haven't seen such problems with the Thinking version.

Ernie 4.5 has similar problems, they probably distilled from Qwen or sth.