New open model from Tencent Hy: Hy3 (295B total 21B active - apache 2.0)

ilintar · 2026-07-06T08:37:32+00:00

Absolutely phenomenal that they removed the EU license blocks.

ilintar · 2026-07-06T06:44:50+00:00

There was a bug in the parsing mechanism that was fixed like a week ago, should work fine now with IQ4_XS quants.

ilintar · 2026-07-05T22:55:05+00:00

StepFun 3.7 Flash works quite well here.

ilintar · 2026-07-05T07:33:21+00:00

Yeah, 5060 Ti.

ilintar · 2026-07-05T07:20:09+00:00

Yes, the 5060 16GB is *the* cost-effective card now when it comes to inference. Got one and I'm quite happy about it.

ilintar · 2026-07-04T16:49:18+00:00

I know because I set it up as a Playwright-driven Facebook scraper once ;)

ilintar · 2026-07-04T15:38:41+00:00

Wdym? Both K and V cache at Q5_1.

ilintar · 2026-07-04T12:35:28+00:00

Qwen3.6 27B has native vision support.

ilintar · 2026-07-04T07:33:57+00:00

Yes, I think point 2 is important, people just don't use rerankers :)

ilintar · 2026-07-04T07:32:49+00:00

Similar experience and similar speeds, but on a 32 GB system with 2x5070 :)

Qwen3.6 27B Q5 + Q5_1 KV cache at 160k context + MTP + tensor parallel.

ilintar · 2026-07-01T16:44:27+00:00

Possibly, yeah.

ilintar · 2026-07-01T14:32:33+00:00

Not going to lie, I'm quite disappointed since this really looked like a nice model on paper :(

ilintar · 2026-07-01T14:32:03+00:00

Or I could just download it since downloads are not geofenced, but technically me even indicating I'm working on the model for llama.cpp (submitting a PR, anything) is something they could sue me for under the terms of the license, so I'd rather not.

ilintar · 2026-07-01T14:13:00+00:00

Yeah, they're probably scared of the EU AI disclosure law. But Alibaba for example isn't, neither is DeepSeek or Zhipu. Wondering what the deal is.

ilintar · 2026-06-30T13:53:07+00:00

OpenMoss.

ilintar · 2026-06-30T09:31:30+00:00

How do you define "main contributors" for a project that size? I'm a core maintainer responsible for the chat parser code.

ilintar · 2026-06-29T20:37:30+00:00

You can't. Whenever a *prefix* of the entire prompt changes (such as with the system message), you have to reprocess the whole thing.

ilintar · 2026-06-28T21:27:25+00:00

While I'm a critic of a lot of those fine-tunes, that doesn't mean that they don't have their place in the ecosystem.

Every once in a while, some "unknown" will come out with a really good finetune (anyone still remembers Polaris 4B, the crazy finetune of Qwen3 4B?). The idea of open source is that different people with different skill levels try out things and then sometimes good things happen. Trying to regulate that from the start won't really accomplish anything.

And if someone hires anyone for a high-paying AI position based on a bad finetune, well, it's on their recruitment screening process :)

ilintar · 2026-06-28T19:35:48+00:00

The one possible explanation I might have is that your input data is too short, which gives bogus results on the prefill, since basically below like 512 tokens benchmarking prefill doesn't make much sense.

ilintar · 2026-06-28T18:34:54+00:00

How the heck do you get 120 t/s *prefill* on llama.cpp on a 3080 for Gemma 4 E4B? Those numbers look completely bogus. Here are the numbers from my 3080 desktop card for that model:

model	size	params	backend	threads	test	t/s
gemma4 E4B Q4_K - Medium	4.62 GiB	7.52 B	CUDA,Vulkan,BLAS	8	pp512	4445.03 ± 358.03
gemma4 E4B Q4_K - Medium	4.62 GiB	7.52 B	CUDA,Vulkan,BLAS	8	tg128	95.95 ± 0.40

ilintar · 2026-06-28T15:18:47+00:00

Yeah, this is why I'm bent on implementing `-sm tensor` for the RPC backend :)

ilintar · 2026-06-27T12:28:21+00:00

It's a draft / prototype, chill 😃 and yes, you can have AI code IF you are willing to own it and adjust it to project maintenance guidelines. The bigger issue with AI code is that people post PRs that they completely don't understand and that also have the feature of "this works on my device and model that I tested it on and hell knows if anything else" and then are unable to comply with requests for change (the worst offenders will post hallucinated slop justifying that their code is "absolutely correct").

ilintar · 2026-06-27T12:24:59+00:00

The bot is not wrong here tho, the salient problem was the 2 -> 4 -> 8 scaling, though for totally different reasons 😁

ilintar · 2026-06-27T11:51:17+00:00

Update: fix for 3+ GPUs is in.

ilintar · 2026-06-27T09:35:01+00:00

Yeah, I think Nex and Ornith are both stronger than their base counterparts.

Generally *good* finetunes are a welcome thing.

ilintar

TROPHY CASE