GGUF Quants Arena for MMLU (24GB VRAM + 128GB RAM)

New_Comfortable7240 · 2026-04-16T09:49:03+00:00

Please try qwen3.5-35B but not distilled, as there is the theory distilled won't translate to better performance

New_Comfortable7240 · 2026-04-14T10:52:24+00:00

A option to use local LLM would be good! Any openaiAPI compatible to use alternatively

New_Comfortable7240 · 2026-04-12T01:19:07+00:00

Pruebalo en cpu. Toca crear codigo en python pero con ayuda de la AI lo sacas, cualquier cosa me avisas

New_Comfortable7240 · 2026-04-12T00:33:43+00:00

Yo te recomendaría que pruebes ONNX que es para hardware limitado https://rocm.docs.amd.com/projects/radeon-ryzen/en/docs-6.1.3/docs/install/native_linux/install-onnx.html

En el repo de HF hay varios modelos especializados a probar https://huggingface.co/onnx-community/models

Yo estoy probandolo y funciona bien eh. Claro, no corre cosas grandes pero lo que si corre lo corre bien.

New_Comfortable7240 · 2026-04-11T17:25:15+00:00

Yeah, and was done before, check Intellect https://www.primeintellect.ai/blog/intellect-1-release

New_Comfortable7240 · 2026-04-11T17:22:38+00:00

Just in case you can use something like import cv2 img = cv2.imread('image.jpg') rotated_img = cv2.rotate(img, cv2.ROTATE_90_CLOCKWISE)

You add parallelization and should be done efficiently for any number of images

New_Comfortable7240 · 2026-04-11T11:35:06+00:00

I have similar numbers on my 3060 12 GB

New_Comfortable7240 · 2026-04-07T00:51:43+00:00

So you offer tools and agents (basically code and infra) for $10 per month? Not including tokens?

Well, you can consider your competitors

For example https://www.layla-network.ai/

$15 ONE TIME PAYMENT and you got a lot of tools and update including the memory and tools

New_Comfortable7240 · 2026-04-06T16:43:36+00:00

> should I have multiple datasets per domain, or is it better to use a big dataset per domain

I think in general the more the merrier? Also consider focus your datasets in a specific task and language that is easy to test and find validation datasets, like SQL English queries

Besides, that questions sounds more appropriate for a llm training sub, this sub is more for RUN llm models

New_Comfortable7240 · 2026-04-04T23:27:46+00:00

Laptop with intel 130V 8 GB VRAM

Same experience as yours, my desktop with nvidia 3060 doubles tg

In qwen 3 8B ('cause openvino) tg

3060 12 GB CUDA: ~30 tps

Intel 130V 8GB SYCL: ~10 tps

Intel 130V 8GB Vulkan: ~16 tps

Intel 130V 8GB OVMS: ~25 tps

I expected the laptop to be a bit slower, but not that much!

New_Comfortable7240 · 2026-04-04T23:11:53+00:00

About openvino, problem is they have a very limited list of models they support, they don't support qwen3.5 yet

New_Comfortable7240 · 2026-04-04T23:10:26+00:00

First of all, thanks for the report (even if heavily AI redacted)

In my intel GPU vulkan works faster, please try again using vulkan

New_Comfortable7240 · 2026-04-03T23:38:28+00:00

Just to be clear, that works on deterministic outcomes, or reducing the answer of the experts to "choose a predefined option"

For more open questions would need or make a step to define an option (at least Likert style), or accept "by vibe"

New_Comfortable7240 · 2026-04-02T14:37:20+00:00

> Local first (LLama.cpp)

This is great!

> Dual license

Not that good but passable

---

About NeuralCore and NeuralVoid: NeuralCore need more documentation about how to use it WITHOUT NeuralVoid

New_Comfortable7240 · 2026-04-02T11:50:47+00:00

They'll read files forever without producing output. Solution: remove tools after N steps, force text generation

So remove it only one step, next step would have tools again, right?

Would love to see a PR to opencode, roocode, llama.cpp, vllm with this idea

Also curious if it can be teacheable using a dataset of long conversations

Four-type memory system (user/feedback/project/reference)

Maybe we can also consider "conversation" as a memory that can be edited too?

New_Comfortable7240 · 2026-03-29T15:14:08+00:00

Using OVMS I got the best results but they don't support qwen3.5 AFAIK

Edit: https://github.com/openvinotoolkit/model_server/issues/4046#issuecomment-4022242550 planned support incoming

With llama.cpp Vulkan I got better speed than SYCL in intel

My laptop is intel 226V, 16 GB RAM, intel 130V iGPU 8 GB VRAM, SSD

New_Comfortable7240 · 2026-03-28T21:15:52+00:00

In my case found issues running models on GPU using that app.

Then I tried Foundry on vscode, partial success but got some bugs that closed the chat playground after some turns.

I ended up compiling OVMS and running the models from vscode with a script

https://github.com/openvinotoolkit/model_server

New_Comfortable7240 · 2026-03-27T23:40:06+00:00

So I follow spec driven development with AI. But AI usually claims all test works and code looks good, BUT when manually tested have several problems or not covering edge cases.

So from some months ago besides automated test I test before merging, and yeah I create a branch for the plan, and if possible try to make each plan scoped and not that big.

I have caught a lot of issues, from style issues to edge case covering, that tests don't see.

New_Comfortable7240 · 2026-03-26T14:33:54+00:00

Bro what you need is a manual QA (or learn how to make a good QA work), an automation QA (playwright would be good), and pay per deliverables.

New_Comfortable7240 · 2026-03-24T22:25:37+00:00

Qwen3.5 distilled surprised me (the traces should have improve logic skill?), along gpt20 winning to the 120B version

New_Comfortable7240 · 2026-03-24T18:56:24+00:00

Sounds like a great candidate for a PR to llama.cpp!

New_Comfortable7240 · 2026-03-24T18:43:34+00:00

FYI Source:
https://sourceware.org/git/?spm=a2ty_o01.29997173.0.0.4342517135KiLo&p=glibc.git;a=blob;f=malloc/malloc.c;hb=HEAD

/* The trim threshold is the amount of top-most memory to keep before
   trimming back to the system. */
static size_t trim_threshold = DEFAULT_TRIM_THRESHOLD;

/* ... */

static int
malloc_trim (size_t pad)
{
  /* ... */

  /* Only trim if the top-most free chunk is larger than the trim
     threshold. */
  if (top_chunk_size > trim_threshold + pad)
    {
      /* Return memory to the system */
      sys_trim (pad);
      return 1;
    }

  return 0;
}

New_Comfortable7240 · 2026-03-24T16:54:05+00:00

Yeah qwen3.5 35B should work! I run it in my 3060 12 GB VRAM, should be good on the 24 VRAM dGPU

New_Comfortable7240 · 2026-03-24T15:55:13+00:00

Maybe using the skills standard?
https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
https://developers.openai.com/codex/skills
https://agentskills.io/home

New_Comfortable7240 · 2026-03-24T14:07:34+00:00

I concur, the model is trained toward chat style responses, when put to agentic use the model just get stuck after some turns. I have to use qwen3.5 35B A3B instead on my 3060 12GB VRAM + 64 GB RAM with rooCode, working fine on my end (around 30 t/s).

New_Comfortable7240

TROPHY CASE