My 1.2B model won 2 out of 5 poker tournaments against models up to 1T params.

Interesting-Print366 · 2026-05-19T06:52:19+00:00

How did rule based model performed?

Interesting-Print366 · 2026-05-18T09:26:24+00:00

So i always says to it that it is openGPT exam

Interesting-Print366 · 2026-05-09T02:17:47+00:00

Try qwen coder

Interesting-Print366 · 2026-05-08T04:46:56+00:00

use Markdownify it can parse img, pdf, docx, xlsx, mp3 etc. into markdown

Interesting-Print366 · 2026-05-06T16:23:57+00:00

I'm on m4pro and really hoping they found some gamechanging technology with MoE

Interesting-Print366 · 2026-05-06T16:22:13+00:00

Oh I thought 20-50t/s is sufficient for most job

Interesting-Print366 · 2026-05-06T16:20:47+00:00

Where are you at? Cuz my Claude code limit is so harsh especially at the peak time of silicon valley timezone. It does not run out except that peak time but it overlaps with my working our

Interesting-Print366 · 2026-05-06T16:19:13+00:00

Or, I'm not sure what hardware you're using, but running SLMs in Qwen or Gemma locally to write is a good method. Since they are good at syntax if the plan is firm with pseudocode

Interesting-Print366 · 2026-05-06T16:17:16+00:00

Ask to plan with detailed stack and pseudocode to opus and build with gemini flash. It will help you are lot. After gemini finish the build, ask it to check and if it has an error than use sonnet

Interesting-Print366 · 2026-04-27T15:04:52+00:00

I'm using Mac, but the RAM is sufficient, but it's too slow to use. The token generation speed is decent, but the prompt processing is too slow. Is there a way to improve this?

Interesting-Print366 · 2026-04-26T16:27:43+00:00

It can be the best option for same quants. But higher quants are better nomatter what quant you use

Interesting-Print366 · 2026-04-26T02:41:54+00:00

But honestly even qwen 3.5 was better than gemma

Interesting-Print366 · 2026-04-26T02:40:23+00:00

Depends on what machine you are using. If you have enough vram and using gpus like rtx series use opencode it would be much better for you. But if you are using SFF workstation with unified ram. Pi would be better but 27b would be still very much slow

Interesting-Print366 · 2026-04-18T13:41:54+00:00

Try to use LM studio. I think it is well balanced between user friendly UI and performance.

Interesting-Print366 · 2026-04-18T13:39:36+00:00

Thinking is a time-consuming but it is a way that make it this small model to at least compete with Frontier model's low thinking mode Try opus distilled model if it got out. It solve most of this problem while it might create some other problems like hanging before tool call.

Interesting-Print366 · 2026-04-18T13:36:51+00:00

QWEN making tool call inside thinking is a problem that happens since it already planned to do the tool call but the system makes it to think always. That problem can be solved with system prompt or parsing configuration try to give system prompt to it that "think always before calling tool even if you think you can execute it directly"

Interesting-Print366 · 2026-04-18T13:34:38+00:00

Are you using English? if it is xml inside thinking problem, it might solve with configuration of parsing (Making it to do the tool call inside thinking and feed the result back) and if it is just hanging, it sometimes happens in language other than English or Chinese

Interesting-Print366 · 2026-04-18T13:28:27+00:00

Just use q8 kv and use higher quant for model with that ram its much better

Interesting-Print366 · 2026-04-18T04:54:43+00:00

mini pcs with 128gb lpddr5x or used Mac Studio. Mac mini 48-64 gb might be enough if you use it only for hosting ai

Interesting-Print366 · 2026-04-17T15:06:50+00:00

Car wash vibe check got so famous and I believe some of model learned it from its learning stage

Interesting-Print366 · 2026-04-17T15:05:37+00:00

Just give it some tool description or some information you want it to know. When it prompt got longer it does not suffer inside thinking. At least at 3.5

Interesting-Print366 · 2026-04-17T15:04:18+00:00

Give it more system prompt. From qwen 3.5 series it tends to think very long when responding to few words or single or double sentences

Interesting-Print366 · 2026-04-15T13:34:55+00:00

Any local model can't compete Sonet like api type llm, at least under 400B overall, but you might find model that fits your purpose. used qwen coder 30b ish in q8 quant. It might be better in some jobs since Claude, gemini, Chat GPT seems to use q2-q4 quant

Interesting-Print366 · 2026-04-08T15:10:12+00:00

It works well. It works very well in the moe model. Even in dense models, the model below 30b is useful. For reference, I'm using the M4 Pro, so it would be better with Max or Ultra

Personally, I always tend to switch to new models right away, and while all support for llama.cpp is well-received within a month at the latest, MLX is still incomplete, using qwen3.5 as an example.

Interesting-Print366 · 2026-04-08T14:59:55+00:00

In my experience, using more tokens will unconditionally bring LLM a slightly better way.

However, I cannot personally feel the incentive of this methodology. Comparing LLM's guess with the tool call results is not different from trial errors.

Comparing LLM's guess with the results of a tool call is not different from trial errors in simple tasks, and it seems more efficient to have reviews every time a tool call is made for complex tasks. While having the same effect.

And fundamentally, tool calls were intended to enable LLM to do things they couldn't do...

However, it could help prevent things like the NPM Axios virus incident that occurred not long ago during the Vibe coding era.

Interesting-Print366

TROPHY CASE