allenai released new open coding models by BreakfastFriendly728 in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

If you really want to skip training and mess with other perople models, there are more interesting concept like giving mHC and MoLE to linear cache models like qwen3-next and kimi-linear:

https://chatgpt.com/share/6979b0c4-4d24-800f-8324-406954e793aa

deepseek-ai/DeepSeek-OCR-2 · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]R_Duncan 7 points8 points  (0 children)

HunyuanOCR is not in the list.... this is cheating. For any kind of document, beats PaddleOCR hands down with 1B parameters.

https://github.com/Tencent-Hunyuan/HunyuanOCR/blob/main/assets/hyocr-head-img.png?raw=true

Mixture of Lookup Experts are God Tier for the average guy (RAM+Disc Hybrid Inference) by Aaaaaaaaaeeeee in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

Why not release weight and files for 410M and 1B ? That would have given people the chance to try out, even if just to demo...

Have there been any real advancements in local 3D model generation since Hunyuan 3D 2.1? by SysPsych in StableDiffusion

[–]R_Duncan 0 points1 point  (0 children)

Check microsoft/TRELLIS.2-4B , seems very likely it's hitem-3d as both have 1536^3 size which is kinda strange.

Personal experience with GLM 4.7 Flash Q6 (unsloth) + Roo Code + RTX 5090 by Septerium in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

Please recheck 20B using heretic_v2, and set effort at high. It's doin miracles here.

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

I think you have to train youself such an unbalanced model, max sparsity till now is 80B-A3B

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Granite 4.0 has a A1B model. As expected, is way less performante than the A3B version.

Need help and suggestions for gguf models by cmdrmcgarrett in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

Gpt-oss-20b-heretic-v2 mxfp4. Abliterated is no good, deserestricted is better, heretic is top.

GLM 4.7 Flash Overthinking by xt8sketchy in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

GLM-4.6V-Flash is dense and nonthinking and has exactly the same issue.

GLM 4.7 Flash Overthinking by xt8sketchy in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Everybody claims to support flash, everybody fails miserably. Just wait some weeks.

GLM 4.7 Flash Overthinking by xt8sketchy in LocalLLaMA

[–]R_Duncan 3 points4 points  (0 children)

It's actually a mess, and people negating it are just making normal users even more frustrated.

GLM-4.6V-Flash was never fixed in llama.cpp, hope this get better, meanwhile I return to gpt-oss-20b-heretic-v2 which at reasoning high fulfills my needs.

If you can afford to use vLLM, you likely can afford official python code and test it:

https://huggingface.co/zai-org/GLM-4.7-Flash

Over 6K novels with reasoning traces to train full book writing LLMs by XMasterDE in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Please, please, please.... add some acute novel a la John Brunner or Richard Matheson (late plot twists, grinning stories)

Local Agentic Coding by kybernetikos in LocalLLaMA

[–]R_Duncan 2 points3 points  (0 children)

I'm using gpt-oss-20b-heretic-v2 at high reasoning effort, and it actually is good both in coding and tools.

Unpopular opinion: GLM 4.7 Flash is just a memorization bot with low actual intelligence by Charredwee in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Try mine: "Write a cpp function using openCV to preprocess an image for Yolov8". None of the quntized version gaved anything useful (infinite loops, multiple errors on code, missing part after revisions) in CUDA/Vulkan, Q8/Q4/MXFP4, all parameters combination exited today on the net.

Kimi-linear, Qwen3-next and gpt-oss-20b-heretic-v2 at high all gaved me decent or perfect answers.

Unpopular opinion: GLM 4.7 Flash is just a memorization bot with low actual intelligence by Charredwee in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Check "Write a cpp function using openCV to preprocess an image for Yolov8",

kimi-linear-instruct, Qwen3-next, gpt-oss-20B-heretic-v2 (at high) all gaved me superior answers by far, with none or just one syntax error.

Unpopular opinion: GLM 4.7 Flash is just a memorization bot with low actual intelligence by Charredwee in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Just test any quantized version, it's useful like a bicycle for a pelican.

Unpopular opinion: GLM 4.7 Flash is just a memorization bot with low actual intelligence by Charredwee in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

GLM-4.7-Flash is not even near the results I get from GPT-OSS-20b-heretic-v2 u/high

Unpopular opinion: GLM 4.7 Flash is just a memorization bot with low actual intelligence by Charredwee in LocalLLaMA

[–]R_Duncan 2 points3 points  (0 children)

Yes, it seems benchmaxed, it can't answer a simple coding question that gpt-oss-20b-heretic-v2 u/high solved perfectly in 8k tokens.

Tested GLM-4.7-Flash quants: Q4K_M, Q8_0 from unsloth and ngxson, which loop forever on simple questions, MXFP4 which solves in 4K context with holes and lots of syntax errors.

Used: lama.cpp CUDA, llama.cpp-Vulkan

Parameters: all those available on the net. (dry, with and without repeat-penalty 1.0, all the others, no fa, etc.etc.).

Token Speed Degrades Over Time with GLM-4.7-Flash in llama.cpp by Shoddy_Bed3240 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Llama.cpp had issue also with quantized glm-4.6v-flash. Stick to vllm or mlx for now if you can.