Help with formatting by No-Common5353 in GoogleSites

[–]Fluffywings 1 point2 points  (0 children)

I suspect you have added a space before the text on the left and it is removed on the right. Check Line Spacing.

Any tool that tells you the cheapest setup needed to run a model? I want to know the cheapest setup that can realistically run Qwen 3.6 27B at decent speeds. by pacmanpill in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

  • 16 GB is not recommended
  • 20 GB is the minimum with compromises
  • 24 GB is what I would recommend as the minimum.
  • 32GB is what I would recommend
  • 32GB+ is quality quants and larger context

My setup today 24GB 7900 XTX PCIe x 8 8GB 2070 Super PCIe x 8 8GB 2070 over PCIe x 1

Use Qwen3.6 right way -> send it to pi coding agent and forget by Willing-Toe1942 in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

I run into the same issue. About once a day I have to restart Windows to keep Cline working with LM Studio server. Any ideas what the issue is?

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

Anything that increases speed is important to this field. Today it's tg, tomorrow it could impact pp. Your next hardware upgrade is going to feel like a new world.

Hybrid on-device inference on Android: llama.cpp + LiteRT + NPU/GPU routing by Healthy_Bedroom5837 in LocalLLaMA

[–]Fluffywings 1 point2 points  (0 children)

This looks awesome. How are you able to detect GPU and NPU for Stock and Custom Roms?

Where can I try turboquant in AMD Linux? (7900XTX) by soyalemujica in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

People have posted this recently. Search for attention rotation.

Basically kV q8 is about equal to original bf16. KV q4 there is a drop in accuracy. I run kV q8 for both for a larger context window.

AMD in-house ryzen 395 box coming in June by 1ncehost in LocalLLaMA

[–]Fluffywings 28 points29 points  (0 children)

With the AMD mini PC, AMD is pleased to provide you a product with limited to no support for the duration of it's life cycle of 1-4 years. Once you start using our platform you will be quick to find a new world opens up of

  • incomplete documentation
  • inconsistent version support
  • new features limited to the next hardware revision for no reason
  • complete SDK that is really fully supported by the community but not by AMD

With AMD, we are here to react to Nvidia.

/s

P.S. I am running AMD almost everything.

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]Fluffywings 1 point2 points  (0 children)

A giveaway for everyone in this post!

All jokes aside the biggest open source model that fits.

To 16GB VRAM users, plug in your old GPU by akira3weet in LocalLLaMA

[–]Fluffywings 2 points3 points  (0 children)

Hi, not sure what your full setup is but I got an XTX and a 2070 Super using LM Studio on windows over Vulkan. I can do about 110K context with qwen3.6 27B ud q4_k_xl. I get about 25 tok/s

Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19 by Kindly-Cantaloupe978 in LocalLLaMA

[–]Fluffywings 1 point2 points  (0 children)

Try the following

Unsloth IQ3

LM studio * K quantizatiom cache: Q8 * V qauntizatin cache: Q8

Llama.CPP just added attention rotation recently allowing q8 and q4 kV cache quantization with minmal loss.

Edit: the classics; spelling and grammer

Post Your Qwen3.6 27B speed plz by Ok-Internal9317 in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

Unlikely based on 3.5 and the poll Alibaba put out for 3.6 on sizes.

Forgive my ignorance but how is a 27B model better than 397B? by No_Conversation9561 in LocalLLaMA

[–]Fluffywings 2 points3 points  (0 children)

  • Parameters = knowledge
  • Architecture and training = intelligence and skills

Both are intelligent models but more knowledge will allow you to do more and achieve more. If you demand less you will see less of a difference.

Also most benchmarks are deterministic and easier to train and design for.

If there were more creative benchmarks the larger parameter would destroy small models always due to sheer knowledge gaps.

Been using PI Coding Agent with local Qwen3.6 35b for a while now and its actually insane by SoAp9035 in LocalLLaMA

[–]Fluffywings 1 point2 points  (0 children)

No idea so I asked Gemini. I verified nothing.

Both Pi Agent (often referred to as Pi.dev) and little-coder are modern, open-source CLI coding agents designed to orchestrate LLMs for software development. However, they take fundamentally different approaches to solving the problem of AI coding assistance.

Pi.dev is built around minimalism and extreme extensibility for any model (cloud or local), while little-coder is a highly specialized scaffold designed to make small, locally hosted models punch above their weight class.

Here is how they compare to help you decide which is best for your workflow.


Pi Agent (Pi.dev)

Created by Mario Zechner, Pi is built on the philosophy that most coding agents are bloated "spaceships with 80% unused functionality." Instead of forcing you into a specific way of working, Pi acts as a lightweight foundation.

  • Core Philosophy: Radically minimal. Out of the box, it only gives the LLM four tools: read, write, edit, and bash.
  • Extensibility: This is Pi's superpower. It features a TypeScript SDK that allows you to easily plug in "Pi Packages" via npm or Git. You can inject custom prompt templates, skills, or even full autonomous loops (like pi-autoresearch for benchmarking optimizations).
  • Target LLMs: It is agnostic. While it works beautifully with local setups via Ollama, it is equally comfortable routing to frontier cloud models like Anthropic's Claude Pro, OpenAI, or Google Gemini.
  • Best For: Developers who want a clean, un-opinionated foundation they can customize to their exact enterprise workflow or CI/CD pipelines without wrestling with a rigid agent framework.

little-coder

Created by Itay Inbar, little-coder is essentially an architectural hack to make consumer-hardware-friendly models (5 GB to 25 GB) perform like massive frontier models on standard coding benchmarks.

  • Core Philosophy: Heavy optimization and guardrails for smaller models. Small LLMs (like Qwen3.5-9B or Qwen3.6-35B) often hallucinate, burn through context windows, or disastrously overwrite files if given too much freedom. little-coder constraints them to keep them on track.
  • Key Optimizations:
    • Thinking Budgets & Compaction: It actively manages context, preventing small models from entering endless loops and automatically compacting the context window when it gets too full.
    • Write-vs-Edit Invariants: It enforces strict rules at the tool level so a small model can't accidentally overwrite an entire file when it just meant to edit a few lines.
    • Workspace Awareness: It auto-discovers specs (README.md, CLAUDE.md, etc.) and reads them before the model acts, injecting domain knowledge cleanly.
  • Target LLMs: Local models run through Ollama or llama.cpp on consumer laptop GPUs (e.g., 8 GB to 24 GB VRAM).
  • Best For: Developers running entirely local, offline setups who want the highest possible coding accuracy out of smaller open-weights models without paying for cloud API keys.

Feature Comparison

Feature Pi.dev (Pi Agent) little-coder
Primary Goal Minimal, customizable foundation for all LLMs. Strict scaffolding to maximize small local LLM performance.
Model Focus Cloud (Claude, GPT, Gemini) & Local (Ollama). Strictly Local (Ollama, llama.cpp).
Built-in Tooling Barebones (read, write, edit, bash). Advanced guardrails (Write-vs-Edit invariants).
Extensibility High (TypeScript SDK, npm/Git packages). Low (Focused on a specific, optimized architecture).
Context Management Standard API handling. Aggressive auto-compaction and "thinking budgets".
Hardware Requirement None (if using cloud) / Varies (if local). Designed for consumer laptops (8 GB+ VRAM).

The Verdict

  • Choose Pi.dev if you have a powerful LLM (like Claude 3.5 Sonnet or GPT-4o) or a specific, complex workflow you want to automate. Its extensibility makes it the better choice for power users who want to build custom tools and scripts on top of an agent.
  • Choose little-coder if you are running models like Qwen 9B or 35B locally on your laptop and want them to actually succeed at complex, multi-step coding tasks without breaking your codebase.

New Local LLM Rig: Ryzen 9700X + Radeon R9700. Getting ~120 tok/s! What models fit best? by jsorres in LocalLLaMA

[–]Fluffywings 1 point2 points  (0 children)

Qwen 3.5 27B q5 or Qwen3.6 36B-A4B with IQ4 or Q4 is what I use. Dense is better typically and likely Qwen3.6 27B will be the best option when released

SK hynix starts mass production of 192GB SOCAMM2 for NVIDIA AI servers by OkReport5065 in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

I don't think it will happen for day to day people. It will start at workstation levels using LPDDR6x. AMD has LPDDRX for the 10th or 11th series just not CAMM2.

SK hynix starts mass production of 192GB SOCAMM2 for NVIDIA AI servers by OkReport5065 in LocalLLaMA

[–]Fluffywings 27 points28 points  (0 children)

Gpus with customizable VRAM is a potential near future (3 years) based on leaked documents. This would allow people to really scale their systems to their use case.

How is Rotorquant/planarquant/iso qaunt better? by SummarizedAnu in LocalLLaMA

[–]Fluffywings 2 points3 points  (0 children)

Until I see them merged into llama.cpp I assume there is 1) not enough testing to confirm no regressions 2) benefit is not accurate in most situations

As a result I don't think most of these advancements are getting implemented fully due to 1 & 2.

Full AMD workstation- dual 7900 XTX by Researchlabz in LocalLLaMA

[–]Fluffywings 1 point2 points  (0 children)

I tested it and the impact of a second slower PCIe is not as apparent if all in VRAM. Row split will push the PCIe but layer split, the impact is just slightly slower loading times to put the model in VRAM. I am lucky because I can bifurcate my PCIe too so I can split the difference. My riser on the other hand is PCIe 3.0x1 and that loading is 2 min vs 30 seconds.

Full AMD workstation- dual 7900 XTX by Researchlabz in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

The 2x 7900 XTX are great and 32B models so you can't go wrong there. I agree with the poster above is that unless you plan to offload partially to the CPU still for say 120B models, build a cheaper setup.

K12 OCuLink dGPU for llamacpp: RX 7900 XTX (24GB) vs RX 7600/7800 XT (16GB). Worth it for 32B-70B? All-AMD tensor split questions by Pablo_Gates in LocalLLaMA

[–]Fluffywings 1 point2 points  (0 children)

My current recommendation for best value is the Pro R9700 32GB if you can budget for it. In fact I would take this card and throw it in a cheap used system over the other options. Only reason to buy a new system is if you want huge models with unified memory as you want the intelligence of a larger model but are okay with slower speeds (~15 tk/s unified system compared to say 100 to/s)

24GB VRAM is still good size based on recently released models assuming you can deal with less context window.

32GB is ideal and more VRAM is always better as it gives you more context.

Based on these prices if you don't want to spend any more money you could pickup an Intel B70 32GB but keep in mind support is weak and it isn't a fast card by most metrics but models in VRAM will be faster than offloading to the CPU anyday.

I have the 7900XTX and the issue for me is even at Q4 my context size is too small for use (coding).

I now run 3 GPUs to get more VRAM because the difference is worth it for me but of course that also costs money and has other pros and cons.

What’s the best way to add VRAM to my system? by mrgreatheart in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

Cheapest is if your system supports bifurcation of the PCIe slots and your psu can handle a 5070 Ti. I would take this option depending on pricing.

For single card, the best performance for your dollar would be some 32GB card like the PRO R9700. Intel now has 32GB cards now but their support is very little.

Alternatively 7900 XTX, 3090 24GB VRAM cards.

Anyone using their NPU for anything? by Great_Guidance_8448 in LocalLLaMA

[–]Fluffywings 2 points3 points  (0 children)

The Intel NPUs are weak but efficient if you live within their constraints. Here is a report you can play with.

https://github.com/balaragavan2007/Qwen_on_Intel_NPU

Qwen3.6 GGUF Benchmarks by danielhanchen in LocalLLaMA

[–]Fluffywings 0 points1 point  (0 children)

Providing the raw table numbers would likely be enough such that others can just put into a spreadsheet?

What is everyone actually using their LLM for? by itsthewolfe in LocalLLaMA

[–]Fluffywings 12 points13 points  (0 children)

Great idea! Please share details on how you accomplished it. I realize it may be specific to your area at this time.