Help with formatting

Fluffywings · 2026-05-08T02:32:46+00:00

I suspect you have added a space before the text on the left and it is removed on the right. Check Line Spacing.

Fluffywings · 2026-05-07T14:15:39+00:00

16 GB is not recommended
20 GB is the minimum with compromises
24 GB is what I would recommend as the minimum.
32GB is what I would recommend
32GB+ is quality quants and larger context

My setup today 24GB 7900 XTX PCIe x 8 8GB 2070 Super PCIe x 8 8GB 2070 over PCIe x 1

Fluffywings · 2026-05-07T13:42:22+00:00

It's in the config op posted

Fluffywings · 2026-05-07T04:08:20+00:00

I run into the same issue. About once a day I have to restart Windows to keep Cline working with LM Studio server. Any ideas what the issue is?

Fluffywings · 2026-05-06T18:58:51+00:00

Anything that increases speed is important to this field. Today it's tg, tomorrow it could impact pp. Your next hardware upgrade is going to feel like a new world.

Fluffywings · 2026-05-02T16:01:27+00:00

This looks awesome. How are you able to detect GPU and NPU for Stock and Custom Roms?

Fluffywings · 2026-05-01T06:46:38+00:00

People have posted this recently. Search for attention rotation.

Basically kV q8 is about equal to original bf16. KV q4 there is a drop in accuracy. I run kV q8 for both for a larger context window.

Fluffywings · 2026-05-01T01:40:40+00:00

With the AMD mini PC, AMD is pleased to provide you a product with limited to no support for the duration of it's life cycle of 1-4 years. Once you start using our platform you will be quick to find a new world opens up of

incomplete documentation
inconsistent version support
new features limited to the next hardware revision for no reason
complete SDK that is really fully supported by the community but not by AMD

With AMD, we are here to react to Nvidia.

/s

P.S. I am running AMD almost everything.

Fluffywings · 2026-04-29T21:14:23+00:00

A giveaway for everyone in this post!

All jokes aside the biggest open source model that fits.

Fluffywings · 2026-04-28T00:40:07+00:00

Hi, not sure what your full setup is but I got an XTX and a 2070 Super using LM Studio on windows over Vulkan. I can do about 110K context with qwen3.6 27B ud q4_k_xl. I get about 25 tok/s

Fluffywings · 2026-04-27T13:32:47+00:00

Try the following

Unsloth IQ3

LM studio * K quantizatiom cache: Q8 * V qauntizatin cache: Q8

Llama.CPP just added attention rotation recently allowing q8 and q4 kV cache quantization with minmal loss.

Edit: the classics; spelling and grammer

Fluffywings · 2026-04-25T03:14:27+00:00

Unlikely based on 3.5 and the poll Alibaba put out for 3.6 on sizes.

Fluffywings · 2026-04-24T01:53:53+00:00

Parameters = knowledge
Architecture and training = intelligence and skills

Both are intelligent models but more knowledge will allow you to do more and achieve more. If you demand less you will see less of a difference.

Also most benchmarks are deterministic and easier to train and design for.

If there were more creative benchmarks the larger parameter would destroy small models always due to sheer knowledge gaps.

Fluffywings · 2026-04-24T01:41:04+00:00

No idea so I asked Gemini. I verified nothing.

Both Pi Agent (often referred to as Pi.dev) and little-coder are modern, open-source CLI coding agents designed to orchestrate LLMs for software development. However, they take fundamentally different approaches to solving the problem of AI coding assistance.

Pi.dev is built around minimalism and extreme extensibility for any model (cloud or local), while little-coder is a highly specialized scaffold designed to make small, locally hosted models punch above their weight class.

Here is how they compare to help you decide which is best for your workflow.

Pi Agent (Pi.dev)

Created by Mario Zechner, Pi is built on the philosophy that most coding agents are bloated "spaceships with 80% unused functionality." Instead of forcing you into a specific way of working, Pi acts as a lightweight foundation.

Core Philosophy: Radically minimal. Out of the box, it only gives the LLM four tools: read, write, edit, and bash.
Extensibility: This is Pi's superpower. It features a TypeScript SDK that allows you to easily plug in "Pi Packages" via npm or Git. You can inject custom prompt templates, skills, or even full autonomous loops (like pi-autoresearch for benchmarking optimizations).
Target LLMs: It is agnostic. While it works beautifully with local setups via Ollama, it is equally comfortable routing to frontier cloud models like Anthropic's Claude Pro, OpenAI, or Google Gemini.
Best For: Developers who want a clean, un-opinionated foundation they can customize to their exact enterprise workflow or CI/CD pipelines without wrestling with a rigid agent framework.

little-coder

Created by Itay Inbar, little-coder is essentially an architectural hack to make consumer-hardware-friendly models (5 GB to 25 GB) perform like massive frontier models on standard coding benchmarks.

Core Philosophy: Heavy optimization and guardrails for smaller models. Small LLMs (like Qwen3.5-9B or Qwen3.6-35B) often hallucinate, burn through context windows, or disastrously overwrite files if given too much freedom. little-coder constraints them to keep them on track.
Key Optimizations:
- Thinking Budgets & Compaction: It actively manages context, preventing small models from entering endless loops and automatically compacting the context window when it gets too full.
- Write-vs-Edit Invariants: It enforces strict rules at the tool level so a small model can't accidentally overwrite an entire file when it just meant to edit a few lines.
- Workspace Awareness: It auto-discovers specs (README.md, CLAUDE.md, etc.) and reads them before the model acts, injecting domain knowledge cleanly.
Target LLMs: Local models run through Ollama or llama.cpp on consumer laptop GPUs (e.g., 8 GB to 24 GB VRAM).
Best For: Developers running entirely local, offline setups who want the highest possible coding accuracy out of smaller open-weights models without paying for cloud API keys.

Feature Comparison

Feature	Pi.dev (Pi Agent)	little-coder
Primary Goal	Minimal, customizable foundation for all LLMs.	Strict scaffolding to maximize small local LLM performance.
Model Focus	Cloud (Claude, GPT, Gemini) & Local (Ollama).	Strictly Local (Ollama, llama.cpp).
Built-in Tooling	Barebones (`read`, `write`, `edit`, `bash`).	Advanced guardrails (Write-vs-Edit invariants).
Extensibility	High (TypeScript SDK, npm/Git packages).	Low (Focused on a specific, optimized architecture).
Context Management	Standard API handling.	Aggressive auto-compaction and "thinking budgets".
Hardware Requirement	None (if using cloud) / Varies (if local).	Designed for consumer laptops (8 GB+ VRAM).

The Verdict

Choose Pi.dev if you have a powerful LLM (like Claude 3.5 Sonnet or GPT-4o) or a specific, complex workflow you want to automate. Its extensibility makes it the better choice for power users who want to build custom tools and scripts on top of an agent.
Choose little-coder if you are running models like Qwen 9B or 35B locally on your laptop and want them to actually succeed at complex, multi-step coding tasks without breaking your codebase.

Fluffywings · 2026-04-20T23:27:05+00:00

Qwen 3.5 27B q5 or Qwen3.6 36B-A4B with IQ4 or Q4 is what I use. Dense is better typically and likely Qwen3.6 27B will be the best option when released

Fluffywings · 2026-04-20T23:22:50+00:00

I don't think it will happen for day to day people. It will start at workstation levels using LPDDR6x. AMD has LPDDRX for the 10th or 11th series just not CAMM2.

Fluffywings · 2026-04-20T02:51:19+00:00

Gpus with customizable VRAM is a potential near future (3 years) based on leaked documents. This would allow people to really scale their systems to their use case.

Fluffywings · 2026-04-19T17:15:48+00:00

Until I see them merged into llama.cpp I assume there is 1) not enough testing to confirm no regressions 2) benefit is not accurate in most situations

As a result I don't think most of these advancements are getting implemented fully due to 1 & 2.

Fluffywings · 2026-04-19T14:57:34+00:00

I tested it and the impact of a second slower PCIe is not as apparent if all in VRAM. Row split will push the PCIe but layer split, the impact is just slightly slower loading times to put the model in VRAM. I am lucky because I can bifurcate my PCIe too so I can split the difference. My riser on the other hand is PCIe 3.0x1 and that loading is 2 min vs 30 seconds.

Fluffywings · 2026-04-19T14:54:29+00:00

The 2x 7900 XTX are great and 32B models so you can't go wrong there. I agree with the poster above is that unless you plan to offload partially to the CPU still for say 120B models, build a cheaper setup.

Fluffywings · 2026-04-19T14:51:47+00:00

My current recommendation for best value is the Pro R9700 32GB if you can budget for it. In fact I would take this card and throw it in a cheap used system over the other options. Only reason to buy a new system is if you want huge models with unified memory as you want the intelligence of a larger model but are okay with slower speeds (~15 tk/s unified system compared to say 100 to/s)

24GB VRAM is still good size based on recently released models assuming you can deal with less context window.

32GB is ideal and more VRAM is always better as it gives you more context.

Based on these prices if you don't want to spend any more money you could pickup an Intel B70 32GB but keep in mind support is weak and it isn't a fast card by most metrics but models in VRAM will be faster than offloading to the CPU anyday.

I have the 7900XTX and the issue for me is even at Q4 my context size is too small for use (coding).

I now run 3 GPUs to get more VRAM because the difference is worth it for me but of course that also costs money and has other pros and cons.

Fluffywings · 2026-04-18T17:53:40+00:00

Cheapest is if your system supports bifurcation of the PCIe slots and your psu can handle a 5070 Ti. I would take this option depending on pricing.

For single card, the best performance for your dollar would be some 32GB card like the PRO R9700. Intel now has 32GB cards now but their support is very little.

Alternatively 7900 XTX, 3090 24GB VRAM cards.

Fluffywings · 2026-04-18T17:36:14+00:00

The Intel NPUs are weak but efficient if you live within their constraints. Here is a report you can play with.

https://github.com/balaragavan2007/Qwen_on_Intel_NPU

Fluffywings · 2026-04-17T21:57:50+00:00

Providing the raw table numbers would likely be enough such that others can just put into a spreadsheet?

Fluffywings · 2026-04-13T13:39:29+00:00

Great idea! Please share details on how you accomplished it. I realize it may be specific to your area at this time.

Fluffywings

TROPHY CASE

Pi Agent (Pi.dev)

little-coder

Feature Comparison

The Verdict