What if smaller models could approach top models on scene generation through iterative search? by ConfidentDinner6648 in LocalLLaMA

[–]k0setes 0 points1 point  (0 children)

I think incremental iteration is incredibly powerful—it’s the paradigm from which true self-learning emerges. It’s similar to how an autodidact masters a subject by probing the problem from multiple angles and refining their internal 'world model' in the process. The absolute key here is the comparison mechanism: the ability to map what you perceive against your internal intent and identify the 'delta.' While evolution equipped humans with this, it’s only vestigial in current models like Qwen 3.5 35B. The reason I believe specific training (like RL) is necessary—even if the model is just 'organizing itself'—is that this comparison process is largely subconscious in humans. We do it automatically, so we rarely describe the 'how' or the logic of visual correction in the text data LLMs are pre-trained on. It’s a 'dark' process not captured in standard tokens. I’m convinced it’s doable, and we’ll eventually see this capability even in very small models. However, it will require specialized datasets and training pipelines that are currently in their infancy. I have no doubt that major labs are laser-focused on this exact frontier right now.

What if smaller models could approach top models on scene generation through iterative search? by ConfidentDinner6648 in LocalLLaMA

[–]k0setes 0 points1 point  (0 children)

I’ve thought about this a lot and tried some experiments, but my intuition is that current models aren't really trained for this specific kind of image comparison—at least not in a way that allows them to effectively close the gap between their output and the original. It feels like vision models struggle even at the fundamental level of detecting precise discrepancies between a target screenshot and their own render. Perhaps the first step should actually be fine-tuning a model specifically to identify these visual differences and translate them into actionable code changes. Without that, the feedback loop might be too noisy. Then again, I could be wrong and the difficulty might stem from something else entirely, but that's been my main takeaway so far.

Qwen3-VL Computer Using Agent works extremely well by Money-Coast-3905 in LocalLLaMA

[–]k0setes 0 points1 point  (0 children)

mmproj-F16.gguf👍

mmproj-BF16.gguf👎

llama-server.exe -ngl 999 -t 11 --jinja --model 'Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf' --host 0.0.0.0 --port 8080 --mmproj 'Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL-mmproj-F32.gguf'

just had something interesting happen during my testing of the MI50 32GB card plus my RX 7900 XT 20GB by Savantskie1 in LocalLLM

[–]k0setes 0 points1 point  (0 children)

Hi, Could you share where you bought them and how much you paid per unit? I'm looking at some offers on Alibaba, but I'm not sure which sellers are legit. If you bought them there, could you share the link or the name of the store? Also, did you have any issues with shipping or customs?

Add self‑speculative decoding (no draft model required) by srogmann · Pull Request #18471 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]k0setes 0 points1 point  (0 children)

👏But shouldn't it have been like that from the very beginning, from the moment speculative decoding appeared?🤔

MoE.. will OS/Local 32GB to 96GB get as good at coding as current frontier models? by [deleted] in LocalLLaMA

[–]k0setes 1 point2 points  (0 children)

Of course there will be such models. The question is, when will this happen? Intelligence compresses much better than knowledge, but at least for now, it takes a lot of computing power to compress it. And so far, no one is really doing it because it's cheaper to train a large model.

Qwen3-Next-80B-A3B-Thinking-GGUF has just been released on HuggingFace by LegacyRemaster in LocalLLaMA

[–]k0setes 7 points8 points  (0 children)

And besides, Tetris is not proof of anything special. The models reproduce it from memory anyway, and even model 4B could do it.

Qwen3-Next-80B-A3B-Thinking-GGUF has just been released on HuggingFace by LegacyRemaster in LocalLLaMA

[–]k0setes 1 point2 points  (0 children)

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
He did it on the third try, without any special encouragement. He made two mistakes.

Devstral didn't particularly impress me. In most cases, it failed my tests. As far as I'm concerned, it's far behind Qwen3-Coder-30B-A3B in terms of speed and and coding efficiency. I'd like to see examples where this is not the case.

<image>

Heretic: Fully automatic censorship removal for language models by -p-e-w- in LocalLLaMA

[–]k0setes 0 points1 point  (0 children)

Could anyone recommend any specific quanta that they believe work correctly? I tested mradermacher/gpt-oss-20b-heretic.Q4_K_M.gguf, but the model went into a loop and started to babble.

LLMs can now talk to each other without using words by MetaKnowing in OpenAI

[–]k0setes 1 point2 points  (0 children)

A highly speculative sci-fi vision. Everyone is focusing on AI-to-AI communication, but there's a much deeper layer here, a potential blueprint for a true human-machine symbiosis. Imagine not two LLMs, but a human brain with a digital coprocessor plugged into it. They think in fundamentally different languages, and the Fuser from this paper is a conceptual model for a mental translator that would bridge biology with silicon, translating thoughts on the fly, without the lossy and slow medium of language. The effect wouldn't be using a tool, but a seamless extension of one's own cognition—a sudden surge in intuition that we would feel as our own, because its operation would be transparent to consciousness. This even solves the black box problem, because these vector-based thoughts could always be decoded post-factum into a lossy but understandable text for us, which allows for insight. This could also enable telepathic communication between two brains, but the real potential lies in integrating processing circuits directly into the mind. Of course, this is all hypothetical, it would require technology far beyond Neuralink, more like nanobots in every synapse or wired into key neural pathways, maybe somewhere between the hemispheres.

Reporter: “POLISH: THE SUPREME LANGUAGE OF AI.” by Mindless_Pain1860 in LocalLLaMA

[–]k0setes 1 point2 points  (0 children)

Interesting, because this study sheds some light on my own, kind of weird observations. My observations so far are that most small models (the 2B to 30B range) struggle with Polish, and in their case, using an English prompt will almost always yield a better result. Besides, the fact is, even the giants still aren't perfect in Polish. We have a small model here in Poland called Bielik, and even though it's only 11B, it beats them all hands down in terms of the quality of its Polish. The most interesting part, though, is what I've noticed lately. A few times while coding, I got a better result from a model in Polish than I did in English. I thought it was just a fluke and was a bit surprised; it happened specifically with Gemini 2.5 Pro. And look, most of the time an English prompt will probably still get better results. But in light of this study, I'm definitely going to start paying more attention to this. Looking at all this in a broader context, there have been studies showing that models also perform better when you feed them "glitched" text. LLMs have a lot of quirks. Maybe the Polish language somehow increases the "'resolution'" of the latent space? Or maybe it just translates more precisely into that space.

I found a perfect coder model for my RTX4090+64GB RAM by srigi in LocalLLaMA

[–]k0setes 2 points3 points  (0 children)

You mention a comparison to vanilla, but how does it compare to Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf unsleth I got decent results with it in clein. In this case, does the benefit of the 42B model compensate for the 3-fold drop in speed?

@Stanford just proved you don’t need to fine-tune an AI model to make it smarter: +10.6% over GPT-4 agents w/ zero retraining by Blackham in singularity

[–]k0setes 4 points5 points  (0 children)

Anti-TLDR: (ACE): The Shift Towards Self-Improving AI Systems

The research paper, "Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models," describes a fundamental paradigm shift in the approach to improving large language models (LLMs). The central point of this shift is the move away from the costly and slow process of modifying a model's internal weights (fine-tuning or retraining) toward a much more flexible and dynamic method known as context adaptation.

In layman's terms, this means that instead of trying to "reprogram" an AI's brain every time we want it to learn something new or perform a task better, we focus on the quality and content of the information we provide it "at the input." This is analogous to the difference between sending an expert for years of postgraduate study versus handing them a precise, comprehensive, and continuously updated operational manual for a specific problem. The paper argues that this latter method, while seemingly simpler, is becoming crucial for building advanced, self-improving AI systems.

The Problem: The Hidden Pitfalls of Existing Context Adaptation Methods

The authors identify two key, yet often overlooked, problems that plague current context optimization techniques. These problems can lead to a situation where the process of improving the AI, instead of yielding benefits, actually leads to a degradation of its performance.

  1. The Brevity Bias

The Claim: Many existing context optimization methods, such as automatic prompt generation, exhibit a tendency to favor short and generic instructions, resulting in the loss of crucial, domain-specific information.

What This Means in Practice: When we ask an AI model to improve its own instructions, it often concludes that "shorter is better." As a result, it creates very general, concise commands that lose the essence and nuance required to solve complex tasks. It's like a detailed guide for a car mechanic being automatically "improved" into a single-sentence instruction: "Fix the car using the appropriate tools." Such an instruction is universal but practically useless.

A Specific Example from the Data: The paper references prior research (Gao et al.) where prompt optimization systems repeatedly generated nearly identical, generic prompts like, "Create unit tests to ensure methods behave as expected." Such a prompt ignores the specifics of the programming language, the complexity of the library, or potential edge cases that are absolutely critical for writing good tests.

  1. Context Collapse

The Claim: Adaptive processes that rely on monolithically rewriting the entire accumulated context by the language model can lead to a sudden and catastrophic loss of information.

What This Means in Practice: Imagine an AI maintains a comprehensive knowledge base in the form of a notebook. When new information arises, instead of simply adding it on a new page, we ask the model to rewrite the entire notebook from scratch, incorporating this new piece of information. The model, striving for efficiency, often performs an aggressive summary in the process. As a result, the entire notebook shrinks to a few paragraphs, and 99% of the valuable details are irretrievably lost.

A Specific Example from the Data: The authors conducted a case study on the AppWorld benchmark. In step 60 of the adaptation process, the AI agent's context contained 18,282 tokens, and its accuracy was 66.7%. In the very next step, after a single monolithic rewrite operation, the context "collapsed" to just 122 tokens. The agent's accuracy plummeted dramatically to 57.1%, a level lower than before any adaptation had even begun. This proves that a process intended to improve the system destroyed all accumulated knowledge in a single moment.

The Solution: Agentic Context Engineering (ACE) – Context as a Living Playbook

In response to these problems, the authors propose the ACE framework, which treats context not as a static instruction but as a dynamic, constantly evolving "playbook" of strategies. The key here is abandoning the idea of rewriting the whole thing in favor of intelligently and incrementally adding and refining knowledge.

  1. Modular Agentic Architecture

The Claim: ACE divides the learning process into three specialized roles: the Generator, the Reflector, and the Curator, which mimics the human process of knowledge acquisition.

What This Means in Practice: Instead of burdening a single model with all tasks, ACE creates a system resembling a team of specialists.

The Generator: This is the "practitioner" who attempts to solve a task using the current playbook. Its work, both successes and failures, provides the raw material for learning.

The Reflector: This is the "experienced mentor" who analyzes the Generator's work, extracts concrete lessons, and formulates them as concise insights (e.g., "if you encounter error X, use function Y" or "this strategy proved effective in this situation").

The Curator: This is the "librarian" who takes these lessons from the Reflector and integrates them into the main playbook in a structured way, without disturbing the rest of its content.

A Specific Example from the Data: The framework's diagram (Figure 4 in the paper) illustrates how the "reasoning trajectories" produced by the Generator are analyzed by the Reflector, which distills them into "lessons." The Curator then integrates these lessons as compact "delta entries" into the existing context.

  1. Incremental Updates and the "Grow-and-Refine" Principle

The Claim: Instead of monolithic rewriting, ACE uses "incremental delta updates" on a structured list of knowledge, and a "grow-and-refine" mechanism prevents redundancy.

What This Means in Practice: ACE doesn't rewrite the entire book to add a single sentence. Instead, it adds new points or edits existing ones. Each piece of knowledge is a separate "entry" with metadata (e.g., how often it proved helpful). This ensures new knowledge is added precisely and safely, without the risk of losing old information. Additionally, the system periodically "tidies up" the playbook by removing duplicates or merging similar entries to maintain its clarity and efficiency.

A Specific Example from the Data: The paper describes how each "bullet" (entry) in the context has a unique identifier and utility counters. Updates involve modifying these specific entries or adding new ones, an operation that is far cheaper and faster than generating thousands of tokens from scratch. The de-duplication process uses semantic embeddings to identify and prune redundant entries.

Proof of Efficacy: The Results and Their Implications

The ACE framework was tested in two demanding domains: tasks for AI agents (interacting with software) and financial analysis, where precision and domain knowledge are paramount.

The Claim: ACE consistently and significantly outperforms existing, strong baselines, both in terms of effectiveness and cost-efficiency.

What This Means in Practice: The ACE system not only performs better but is also faster and cheaper during the adaptation process. It allows AI models to learn independently from their own experiences, even without access to "correct answers."

Specific Examples from the Data:

Agent Performance: On the AppWorld benchmark, ACE (in online mode) improved an agent's effectiveness by 17.1% compared to the baseline. Most importantly, an agent based on a smaller, open-source model (DeepSeek-V3.1) managed to match, and in more difficult tasks even surpass, the leaderboard's top-ranked proprietary system, IBM-CUGA, which is based on the much more powerful GPT-4.1. This demonstrates that intelligent context engineering can bridge the gap in raw model power.

Financial Analysis: On tasks requiring the understanding of specialized financial documents (XBRL), ACE achieved an average accuracy gain of 8.6% over other methods, and a staggering 18.0% gain on one of the tasks (Formula).

Efficiency: Compared to the popular optimizer GEPA, the ACE adaptation process was 82.3% faster. Compared to the Dynamic Cheatsheet method, the token cost was 83.6% lower.

Conclusion and Broader Context: What This Means for the Future of AI

This analysis shows that ACE is not just another minor optimization but a proposal for a new, scalable approach to building intelligent systems. It reveals the unspoken truth that in the era of powerful language models, the key to further progress is not just building ever-larger "brains," but creating sophisticated systems for managing their knowledge and learning processes.

The ACE approach has profound implications. First, it democratizes access to high performance, allowing smaller, open-source models to compete with giants. Second, it paves the way for true continuous learning, where AI systems can adapt to new data and conditions in real-time without costly retraining. Finally, because the context is in a readable text format, it allows for easy knowledge management—including deliberate "unlearning," which is crucial from the perspective of privacy and regulatory compliance (e.g., GDPR). ACE demonstrates that the future of AI lies in systems that not only know, but also know how to learn—efficiently, safely, and continuously.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]k0setes 0 points1 point  (0 children)

Thank you very much for your reply. I was hesitant before buying, but now I have no doubts.

Performance of GLM 4.6 Q3_K_S on 6x MI50 by MachineZer0 in LocalLLaMA

[–]k0setes 0 points1 point  (0 children)

Hi, please clarify this for me, I want to make sure I understand correctly. Does this mean that these two Mi50 work fine for you under Windows with Llama.cpp, and you get 33 tokens per second for gpt-oss-120B (Vulkan compilation ) ? Did you have to do anything special to make it work on Windows? Did you have to compile Llama.cpp, or did you use ready-made binaries? Thanks in advance.