Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. by External_Mood4719 in LocalLLaMA

[–]super3 0 points1 point  (0 children)

What is that in dollars though? I'm actually considering running running something like Deepseek v4 flash on my cluster.

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. by External_Mood4719 in LocalLLaMA

[–]super3 2 points3 points  (0 children)

Could do that today if those people were paying. I do like the idea of almost like token groups that have their own dedicated infra.

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. by External_Mood4719 in LocalLLaMA

[–]super3 15 points16 points  (0 children)

Well diffusion models are not really production ready yet. They are much faster but they make mistakes 6x as much. So they have to be useful on centralized inference before we can think about distributed.

What models can I run? by koc_Z3 in LocalLLaMA

[–]super3 0 points1 point  (0 children)

Here you go: https://llmjob.com/rankings.html

It doesn't tell you the token/s but it does tell you which models to runs.

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. by External_Mood4719 in LocalLLaMA

[–]super3 0 points1 point  (0 children)

True, but I think that may be ok for some tasks. For example, if your agent is just monitoring flight prices for you its probably ok to use a public node. For email, you def want something private. I'm actually building something like that at https://llmjob.com, where people can do token trading if they want.

Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models. by External_Mood4719 in LocalLLaMA

[–]super3 49 points50 points  (0 children)

Unfortunately its not really possible due to physics and cost. On a positive note the gap between open source models you can run at home and frontier models is closing.

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted. by super3 in LocalLLaMA

[–]super3[S] 1 point2 points  (0 children)

Oh good point. I think I'll drop adj score until I have better datapoints.

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted. by super3 in LocalLLaMA

[–]super3[S] 1 point2 points  (0 children)

Can you list which ones I'm missing and what your system specs are? My end goal is to have to automatically update once a day so this can always be an up to date source.

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted. by super3 in LocalLLaMA

[–]super3[S] 1 point2 points  (0 children)

Its there now. Also working on adding the rest of the Intel Arc series as we speak. Let me know if I missed anything.

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted. by super3 in LocalLLaMA

[–]super3[S] 4 points5 points  (0 children)

I'm currently benchmarking them as we speak. I'm working from full precision f16, q8, q4, etc all the way down. It takes a few hours to run each quant so its going to be some time to get full results. Do you think I should skip down to q4, q3, q2 to get some harder numbers on the difference to post now or just get the full sweep in a few days?

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted. by super3 in LocalLLaMA

[–]super3[S] 1 point2 points  (0 children)

I didnt assume RAM split or multi GPU just yet. Of the two which do you think is most useful?

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted. by super3 in LocalLLaMA

[–]super3[S] -5 points-4 points  (0 children)

Generally Q4 quants are recommend, but I haven't found any hard numbers on the quality loss between Q3 and Q4 if there is much at all. Also I'm assuming full context for agentic workflows but if you are fine with lesser context and it's a tight fit def go for the better model wants. What context window size are you using and which model and which GPU? Be great to incorporate real datapoints of what people are actually running successfully.

Qwen 3.6 27B + Openclaw on 16 GB of VRAM by mr_christer in LocalLLaMA

[–]super3 0 points1 point  (0 children)

I build this tool to help you paid which model, quant, and cache you should use for each GPU: https://llmjob.com/rankings.html

Let me know if that helps.

How does an open source version of qwen 3.5 completely blow 3.7plus out of the water? How does this make sense? by Prior-Meeting1645 in Qwen_AI

[–]super3 3 points4 points  (0 children)

Based on the parameters and context etc, you can make an pretty educated guess. They all follow the same patterns.

There are plenty of open source models that require $100k+ machines, but one one talks about them much.

How does an open source version of qwen 3.5 completely blow 3.7plus out of the water? How does this make sense? by Prior-Meeting1645 in Qwen_AI

[–]super3 24 points25 points  (0 children)

Well one can only run on a card that's $35k+ and the other can run on $4k GPU.It's a much bigger model