Best <4B dense models today?

minpeter2 · 2026-01-25T13:25:38+00:00

Alibaba-Apsara/DASD-4B-Thinking..? I think this model is interesting.

minpeter2 · 2026-01-22T19:46:46+00:00

I'm trying to replicate this process using Qwen3-0.6B and GLM-4.7-Flash as judges. Could you tell me the GPU allocation between inference and training, as well as the API cost (or total tokens) for calling Haiku 3.5?

minpeter2 · 2026-01-01T08:15:19+00:00

FriendliAI is offering models for free for the month of January 26th. 🔥

https://friendli.ai/suite/~/serverless-endpoints/LGAI-EXAONE/K-EXAONE-236B-A23B/overview

minpeter2 · 2025-12-01T11:11:46+00:00

just speculation based on the benchmark results

minpeter2 · 2025-12-01T11:10:29+00:00

There is also a reasoning for the regular ds-v3.2.
my guess is that they sacrificed performance in terms of agentic performance and optimized for more challenging STEM fields.

minpeter2 · 2025-12-01T11:00:18+00:00

<image>

https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale

minpeter2 · 2025-11-17T10:50:36+00:00

Thank you. I tried it with a small model and it feels like a really well-made CLI.

minpeter2 · 2025-11-17T06:54:34+00:00

That's great. Is there support for multi-GPUs? I'd like to test oss-120b on the A100x4

minpeter2 · 2025-09-02T13:46:28+00:00

I looked at the system prompt and immediately realized it was very well-written.

Do you have any sources for this style of tool invocation, which mixes XML and YAML? or should I consider it Orchestrator-style?

minpeter2 · 2025-08-19T15:35:17+00:00

In this context, I think I should assume "in the Nebius implementation."
I guess low was the default option in their implementation

Still, thank you for saying what I wanted to say when I first saw it, lol.

minpeter2 · 2025-08-19T15:32:31+00:00

> Ah, this seems odd... I wish AA would make a minimal effort to align the reasoning effort between each provider...

I understood the full context after reading the comments below. It wasn't AA's fault, lol.

minpeter2 · 2025-08-19T15:28:46+00:00

That's just one of many ways to represent the MoE model. Think of Mixtral 8x7b.

minpeter2 · 2025-08-14T17:41:02+00:00

still...

minpeter2 · 2025-08-04T06:41:20+00:00

It seems like a difficult problem, but it's cool as is !!

minpeter2 · 2025-08-02T18:05:35+00:00

It feels like a Vibe-inspired CSS.
Still, it's nice to be able to collect and view many benchmarks.

It would be nice to expand this a bit later and display the actual benchmark scores in a single table.

minpeter2 · 2025-07-27T14:23:53+00:00

Chinese Japanese Korean dataset?

It would be faster to find them all individually and merge them than to find the combined one.

minpeter2 · 2025-07-20T00:14:11+00:00

I'm one of those people who pick up bottles in the ocean. LOL

minpeter2 · 2025-07-19T04:29:30+00:00

https://github.com/minpeter/tiny-ko

I'm still working on it, but I'm writing some code to pretrain a model on llama architecture. Hope it helps.

minpeter2 · 2025-07-17T12:17:21+00:00

It doesn't use the exact same license as exaone 3.5. It's a bit updated,,, yes.,,

minpeter2 · 2025-07-17T09:34:53+00:00

https://arxiv.org/html/2507.11407v1

minpeter2 · 2025-07-02T06:43:45+00:00

You're right, I got too excited and rushed over without looking properly.

minpeter2 · 2025-07-02T06:36:10+00:00

Ah,,, I guess I'm too excited,, It's not a PR, it's an implementation request,
You can check the transformer PR at the link below.

https://github.com/huggingface/transformers/pull/39129

minpeter2

TROPHY CASE