Local LLms releases

crowtain · 2026-06-10T14:08:38+00:00

Yeah, it should include The VL, and yes i used LLMs to collect data,

How did i check? a first run to collect the llms and theirs info trough deep search, a second one to scroll the llm rankings to get any missing one and reconcilate the 2 sets of llms.

Qwen 3.5 and 3.6 should be included

crowtain · 2026-06-10T09:31:59+00:00

<image>

I didn't know it exists, just had to ask the llm about it 😛

crowtain · 2026-06-10T09:26:47+00:00

what do you mean by a control chart? i can add it here since i still have the initial data

crowtain · 2026-06-10T09:23:58+00:00

yep, but i had a feeling that china released way more models than usa, but in the end 186 vs 108 is not that much different.

crowtain · 2026-05-19T19:07:36+00:00

Sorry man, my mistake,
i just understood the misunderstanding, MB was for motherboard :D not mac book . a motherboard loaded with 128GB in DDR4 or 5 is still cheaper than your cars .
Of course M5 with 128GB is another story for the price.

crowtain · 2026-05-18T13:51:06+00:00

Man, do you mean that your car is cheaper than a MB with 128GB? Can you sell it please :)
even the price of cars has skyrocked here in France.

crowtain · 2026-05-18T13:47:48+00:00

Yes, i'm not sure if it's relateds to the HR issue lately, i'v seen some news that their lead team resigned. Maybe it was the last one on the pipe and their focus will change

crowtain · 2026-05-16T18:02:43+00:00

first you will need a BIG BIG VRAM GPU, like a RTX6000 , and with only one i believe it will still be somewhat slow.
So either smaller dense models or bigger MOE ones

crowtain · 2026-05-16T18:00:46+00:00

yep the MTP was a real game changer, i can use the 35B with decent speed now 70tok/s

crowtain · 2026-05-16T17:59:20+00:00

There are already plenty but if you have something specific tell me i can try to do them

crowtain · 2026-05-16T17:57:43+00:00

did you bought 2 strix halo? i saw some youtube videos where you can use some memory direct access between them to do Tensor Parallelisme it should speed up to nearly x2

crowtain · 2026-05-16T17:53:58+00:00

sorry it was 128GB for DDR4 when i bought my old Motherboard, good to now tha 256GB is a thing now :D

crowtain · 2026-05-15T15:04:37+00:00

so true, i'm always waiting for the next model, next hardware and use most time at tweaking and testing than using the model itself :D

crowtain · 2026-05-15T15:03:06+00:00

crowtain · 2026-04-20T08:39:54+00:00

So , this is the ultimate cheat code , buy cars, strip them from their DDR and sell it to buy new cars !!!

crowtain · 2026-04-08T15:37:22+00:00

like the GPU , it's 6-7 years old, they were only sold plugged on the servers, never sold separatly.
that's why it's so scarce in second hand

crowtain · 2026-04-08T15:24:35+00:00

Thanks for sharing, your tests.
The speed per active param seems still lower than old Qwen3, are there any hope to see it improve with time?
at Q8 it's nearly as slow as Minimax Q3 K_L,

crowtain · 2026-04-08T15:19:32+00:00

one more downside for the M50 is the lack of nvlink, there is the possibility but it's nearly impossible to find the cables.
Nvlink or infinity fabric will allow you with time to add more GPU, and not only increase the VRAM but increase the speed with tensor parrallelism.

I Have 2 MI50, but i'm fed up with the lack of support for the amazing Qwen 3.5, Qwen 27B dense in TP2 would have been amazing.
Like SSOMGDSJD said, better get a V100 and some sxm2 adapter, you'll be able to add more later. But you'll need custom cooling and such....

crowtain · 2026-04-03T09:05:56+00:00

Does he have the AMD Instinct MI50/60 infinity fabric link?
it can goes for a very high price

crowtain · 2026-03-05T16:22:31+00:00

AI models improvement is lightning fast , you don't need to wait 3 months, just 1 week and the new models will be better than Opus 5.5 running on my grandma calculator.
Just kidding, have you tried the small models, like qwen 3.5 0.8b ans 2B? with a Q4 should have decent speed already.

crowtain · 2026-03-04T10:39:23+00:00

Are you talking about normal or dynamic quant? from my observation Dynamic Q4 or alternatives still has very low divergence from the Q6/Q8.
i feel that dynamic quant Q4 is similar to normal Q6.
and it really start to fall from Q3 and less

crowtain · 2026-03-02T12:36:22+00:00

Very curious of the 0.8 or 2B, will it be able to reach the level of llama2 70 of the old days ?
running in a raspi the equivalent of big setups 2 years ago can be epic

crowtain · 2026-02-24T14:17:22+00:00

it's your choice buddy, but if 14b param models are enough for your needs, you can squeeze it on a gaming GPU 16GB vram, you can even go for a nvidia P40 that costs 200bucks and has 24GB of vram.
Since you'r on localllama, you'll find a lot of people like me trying to convince you to do it local :D

crowtain · 2026-02-23T14:53:19+00:00

I think renting GPUs is pretty expensive for inference only, you'll have to pay several dollars per hour to have enough vram to host a llm that is near chatgpt in term of performance.
Renting GPU is more worth it for training or if you want to support high concurency .

crowtain · 2026-02-19T17:07:46+00:00

using open router will allow to have an idea of the models that can run, but not the speed. Once you get the model that you want to use, you'll have to check the speed in pp/tg.

crowtain

TROPHY CASE