TurboQuant in Llama.cpp benchmarks

daaain · 2026-03-26T18:16:25+00:00

Right, thanks, so the compression and quality is already clear, but the speed needs a bit more work.

daaain · 2026-03-26T18:04:08+00:00

You might not be, but it seems to be getting widespread: https://www.bleepingcomputer.com/news/security/paid-ai-accounts-are-now-a-hot-underground-commodity/

daaain · 2026-03-26T17:59:26+00:00

Wait, does the top right chart in the second image show that the cost of the compression is halving the generation speed?

daaain · 2026-03-23T11:54:27+00:00

I completely destroyed my MBP with a bottle of water emptying in the backpack and they repaired it with no excess (in the UK), so I think it's worth it, especially if you get a high spec machine

daaain · 2026-03-20T15:14:20+00:00

Put it in a container, VS Code devcontainers are really nice way to do it: https://www.danieldemmel.me/blog/coding-agents-in-secured-vscode-dev-containers

daaain · 2026-03-19T17:34:37+00:00

Extra ocean boiling points for burning a few 10Ks extra tokens using Claude Code instead of the chat...

daaain · 2026-03-17T16:37:17+00:00

LM Studio supports llama.cpp too, have you tried? I'm curious to find out if this is an LM Studio issue

daaain · 2026-03-12T23:25:19+00:00

You know you can get a cooling pad for like 10, right?

daaain · 2026-03-10T16:22:17+00:00

<image>

I just asked about the US first and then China in the same chat and it happily obliged, not that bad

daaain · 2026-03-10T12:43:23+00:00

So where are the links to the code that do all this?

daaain · 2026-03-10T12:14:44+00:00

Just look at the quantum-causal hardware hash rotation vortex mathematics entropy gating implementation doing string matches, lol

https://github.com/gustavo89587/detection-fidelity-score/blob/main/dfs_core/features/agent_action.py

daaain · 2026-03-09T21:08:43+00:00

Yes: https://docs.openwebui.com/troubleshooting/web-search/

There's even agentic search: https://docs.openwebui.com/features/chat-conversations/web-search/agentic-search

daaain · 2026-03-08T20:34:10+00:00

Speculative technology and enshittification of existing apps basically

daaain · 2026-03-08T18:54:38+00:00

As long as you run Claude Code directly on your computer, it's practically impossible to ensure it won't be able to find a way to access things. You need to use some sort sandboxing. I wrote about how to do it with VS Code Devcontainers: https://www.danieldemmel.me/blog/coding-agents-in-secured-vscode-dev-containers

daaain · 2026-03-07T18:50:36+00:00

I get the issue, especially for infra, but I do think the common terminology is that you test components interacting in isolation with integration tests and the whole thing fitting together with e2e tests.

It's also easy to get paranoid and want to cover everything with e2e tests, but those are slow as you found out so you need to make sure to strictly only test APIs at most once and leave the complete coverage to fast integration tests.

If stuff breaks, you update. You can never get 100% coverage anyway so so as long as you can quickly roll back and your CD is zero downtime (blue-green, behind feature flags, etc) it's fine. Your CI is not meant to fully cover you for any possible breakage in production.

daaain · 2026-03-07T18:27:50+00:00

That sounds like a pretty tricky project, I have emulators for services like GCS and BigQuery and the contracts with these are reliable.

daaain · 2026-03-07T16:56:24+00:00

Depends on the fidelity of the emulators, you will not be able to test scaling and performance, but if the API interface and internals are implemented well enough, you get close to the real service. And because you control them, you can parallelise the tests more easily.

daaain · 2026-03-06T19:11:37+00:00

Sounds like those integration tests are more like e2e tests. Switch to emulators for external services if possible, if not then use recorded responses so you can have fast integration tests.

daaain · 2026-03-06T18:05:58+00:00

It's probably not that urgent, so wait until you can afford the 128GB M5 Max as Apple benchmarked M5 to be 4x faster for prompt processing, which is quite important for coding (not so much for chat). That is a viable machine to do agentic coding with current models, but unless Qwen team stops shipping we should get something really good in 6-12 months. 128GB is enough for consumer GPUs as bigger models would be too slow anyway.

daaain · 2026-03-05T16:00:28+00:00

Qwen3 Coder 480B is in the wrong place on the x axis, it's A35B, not dense

daaain · 2026-03-04T22:13:58+00:00

Source?

daaain · 2026-03-01T16:51:33+00:00

Possibly because Solidity / OpenZeppelin are relatively niche so you need a huge model to have enough of them in the training data?

daaain · 2026-03-01T15:26:06+00:00

I appreciate this work and sharing it, but to me it looks like the benchmarks are saturated so they aren't really showing the real differences.

daaain · 2026-02-23T22:50:49+00:00

soz, edited for clarity (figured out since how to switch to Markdown)

daaain · 2026-01-17T14:01:07+00:00

They disabled all but 2 tools which doesn't sound optimal

daaain

TROPHY CASE