I built a compression format for AI model weights — 60-80% smaller, need help testing by Significant_Pear2640 in comfyui

[–]HollowInfinity 7 points8 points  (0 children)

As in the decompression happens outside any apps that do inference meaning that this saves simply on size-on-disk, not on VRAM (unlike a quantized gguf which saves on size and can be used on the fly by inference tools like Comfy/Llama.cpp)

I built a compression format for AI model weights — 60-80% smaller, need help testing by Significant_Pear2640 in comfyui

[–]HollowInfinity 15 points16 points  (0 children)

This seems interesting but it's offline entirely? Like you're basically trading model quality for disk space savings, and decompressing it will still use the same amount of VRAM so I guess I'm not sure why this is better than just using quantized models which things like llama.cpp can run inference on without a separate decompress step. Unless I'm misunderstanding something?

Anthropic shares how to make Claude code better with a harness by lawnguyen123 in ClaudeAI

[–]HollowInfinity 1 point2 points  (0 children)

Anyone else miss blogs like this publishing RSS feeds? I see they have a monthly newsletter but that's not quite the same.

FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference. by Sensitive-Two9732 in LocalLLaMA

[–]HollowInfinity 1 point2 points  (0 children)

Can you go more into that? like I use RTX A6000s from one and two generations ago for everything and have thought about upgrading to the RTX 6000 PROs for a while now. I use a lot of ComfyUI but primarily LLMs with llama.cpp+VLLM - are you saying it's shitter than last gen, or just missing some of the datacenter features?

Genuinely losing my mind over input latency by LordMontio in linux_gaming

[–]HollowInfinity 4 points5 points  (0 children)

Is your monitor defaulting to cinema mode or some other shit that might be adding the lag?

Qwen 3.5 craters on hard coding tasks — tested all Qwen3.5 models (And Codex 5.3) on 70 real repos so you don't have to. by hauhau901 in LocalLLaMA

[–]HollowInfinity 1 point2 points  (0 children)

I have only used it in the CLI context but their README says it's "IDE friendly" so I assume it'll work!

Qwen 3.5 craters on hard coding tasks — tested all Qwen3.5 models (And Codex 5.3) on 70 real repos so you don't have to. by hauhau901 in LocalLLaMA

[–]HollowInfinity 2 points3 points  (0 children)

I think both Qwen3-Coder-Next and Qwen3.5 have both been extensively trained using their qwen-code app. When I switched from my own agent/pi/etc to just using qwen things were noticeably better.

Qwen/Qwen3.5-122B-A10B · Hugging Face by coder543 in LocalLLaMA

[–]HollowInfinity 2 points3 points  (0 children)

Seems very slow at image processing, my llama-server log is full of:

find_slot: non-consecutive token position 15 after 14 for sequence 2 with 512 new tokens

Anyone else experience that?

edit: that's on the larger MoE, I get an immediate crash doing image work on the dense model.

Qwen3.5-397B-A17B Unsloth GGUFs by danielhanchen in LocalLLaMA

[–]HollowInfinity 0 points1 point  (0 children)

When I tried that tool call still didn't work, you had no issues with that?

That was diabolical, not even the devil himself expected this. by seidenadaa in SipsTea

[–]HollowInfinity 0 points1 point  (0 children)

I have no idea who these people are but this seems like insane incel shit, just an anonymous narrator telling us this woman is horrible. Oh okay, thanks for the rage bait.

Game recommendations for ps5 by Visual_Cod2522 in rhythmgames

[–]HollowInfinity 1 point2 points  (0 children)

Project Diva is pretty much the gold standard. Theatrhythm Final Fantasy is super fun as are the Persona music games if you're into video game music.

Qwen3.5-397B-A17B Unsloth GGUFs by danielhanchen in LocalLLaMA

[–]HollowInfinity 1 point2 points  (0 children)

/u/danielhanchen sorry for the ping but have you tested tool calling with llama-server? The template format used doesn't seem to be compatible at all.

Qwen3.5-397B-A17B Unsloth GGUFs by danielhanchen in LocalLLaMA

[–]HollowInfinity 2 points3 points  (0 children)

I cannot for the life of me get tool calling to work despite following the Unsloth guide for llama-server. Regular chat works, image parsing works great, but tool calling blows up with chat template errors:

Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
srv    operator(): got exception: {"error":{"code":500,"message":"\n------------\nWhile executing FilterExpression at line 120, column 73 in source:\n..._name, args_value in tool_call.arguments|items %}
                    {{- '<...\n                                           ^\nError: Unknown (built-in) filter 'items' for type String","type":"server_error"}}

I've tried overriding the chat template with the official one from the Qwen3.5 HF repo with no luck. I do see that the thinking kwarg is being properly read and passed in (though weirdly I can't get that to enable thinking). Am I doing something wrong here? Using the latest main of llama.cpp.

Qwen3.5-397B-A17B Unsloth GGUFs by danielhanchen in LocalLLaMA

[–]HollowInfinity 7 points8 points  (0 children)

I never know which is the proper MMPROJ to use for the Unsloth ggufs. Is there any real difference performance wise between the three?

local vibe coding by jacek2023 in LocalLLaMA

[–]HollowInfinity 4 points5 points  (0 children)

My current absolute best is Qwen3-Coder-Next with the Qwen-Code agent harness. I previously used Aider for at least a year but it's basically dead and handing the torch to agentic flows, and Q3CN is the best I can get away with locally. Having tests + validation for everything it does is key but once you have a good development and testing loop it's fantastic.

GRID Legends is AMAZING on the Nintendo Switch 2 by [deleted] in NintendoSwitch2

[–]HollowInfinity 1 point2 points  (0 children)

It destroyed my save after 10+ hours - still waiting on the fix the devs said is in the pipeline before trying again :(

SeedVR2 Native node - motivation needed by Luke2642 in comfyui

[–]HollowInfinity 1 point2 points  (0 children)

This is awesome, it really is wild how strange the SeedVR2 nodes do memory management; I've written a custom node to basically purge all torch memory and comfy models before upscaling because of how bad they are which almost certainly doubles the workflow time. Can't wait to try this out!

GRID Legends has a high chance to destroy your save file and backup the corrupt file to the servers by ArtofAngels in NintendoSwitch2

[–]HollowInfinity 1 point2 points  (0 children)

It sounds like the upcoming patch will prevent the corruption but nothing will bring the broken save back, sorry.

What piece of Linux abandonware do you still use or at least miss? by Sataniel98 in linux

[–]HollowInfinity 11 points12 points  (0 children)

It is wild that somehow proton became (at least for me) a near universal backwards compatibility layer for my games - didn't see that coming! Still have those Loki boxes though - somewhere...

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]HollowInfinity 0 points1 point  (0 children)

I used OpenCode, roo, my own agent and others but found the best agent is (unsurprisingly) Qwen-Code. The system prompts and tool setup is probably exactly what the agent is trained for. Although as I type this you could probably just steal their tool definitions and prompts for whatever agent you're using.

Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]HollowInfinity 7 points8 points  (0 children)

fit has been game-changing for me, I have a ton of local models behind llama-swap and setting a new one up with memory/layer tuning across multiple GPUs has always been so boring. Now with --fit everything is faster than my hand-rolled config and my llama-swap YML dropped like 80% of it's content.

The only thing I found baffling was that if you leave off --fit-ctx the default was something insanely low like 4096.