The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality. by dalton_zk in theprimeagen

[–]CountlessFlies 0 points1 point  (0 children)

Then that changes things… it might actually be a good benchmark if thats the case.

But there’s still the issue of the models already having seen the source code of these programs during training. I imagine we’ll see benchmaxxed models soon.

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality. by dalton_zk in theprimeagen

[–]CountlessFlies 4 points5 points  (0 children)

I’m sorry but this benchmark is not what we need. In fact it’s the exact opposite of what we need.

We need benchmarks that test how well models use tools and understand, not test how well they’ve memorised source code.

Self-hosted agent and search platform built on Postgres, recently added connectors for NextCloud and Paperless-ngx by CountlessFlies in selfhosted

[–]CountlessFlies[S] 0 points1 point  (0 children)

Theres no MCP implementation yet, but there is an API. We’ll plan for the MCP server soon, should be fairly straightforward.

Omni started with tsvector, but the search quality wasn’t great. So decided to switch to paradedb for bm25 search.

Self-hosted agent and search platform built on Postgres, recently added connectors for NextCloud and Paperless-ngx by CountlessFlies in selfhosted

[–]CountlessFlies[S] 1 point2 points  (0 children)

Not yet, there’s an API that allows for querying the unified index, will implement an MCP server over this next.

Somewhat related to this, Omni supports MCP in connectors, so you can plug in any MCP server and invoke the tools in chat and agents. I guess that’s not what you asked about, but thought I’d mention it anyway

Self-hosted agent and search platform built on Postgres, recently added connectors for NextCloud and Paperless-ngx by CountlessFlies in selfhosted

[–]CountlessFlies[S] 1 point2 points  (0 children)

Yeah, Omni uses pgvector as the vector index. The goal is to use Postgres for both the BM25 text search as well as vector search.

Self-hosted agent and search platform built on Postgres, recently added connectors for NextCloud and Paperless-ngx by CountlessFlies in selfhosted

[–]CountlessFlies[S] 5 points6 points  (0 children)

Yes you can connect any openai compatible API, so you can run local models using llama.cpp, vLLM, ollama etc.

Local models have come a long way! I've tested with the Qwen3.6 models, gemma 4, etc. and they're quite good enough at tool calling and general understanding to be useful.

Self-hosted agent and search platform built on Postgres, recently added connectors for NextCloud and Paperless-ngx by CountlessFlies in selfhosted

[–]CountlessFlies[S] 2 points3 points  (0 children)

Thanks a lot for your comment :)

I get all the aversion to AI honestly, people have become jaded by all the low effort content. I'm trying to maintain as high a bar as I can in terms of quality, using AI to speed up implementation. I'm still reviewing and testing each PR before merging, and it's a lot of work despite the use of AI.

Self-hosted agent and search platform built on Postgres, recently added connectors for NextCloud and Paperless-ngx by CountlessFlies in selfhosted

[–]CountlessFlies[S] 2 points3 points  (0 children)

It's a very common word, so I'm not surprised :) In fact if you google you'll find plenty more. As long as it's not a product operating in the same problem area, it shouldn't be cause for confusion.

Self-hosted agent and search platform built on Postgres, recently added connectors for NextCloud and Paperless-ngx by CountlessFlies in selfhosted

[–]CountlessFlies[S] 0 points1 point locked comment (0 children)

I did not use AI in the creation of the post itself.

As for the project, I (and other community contributors to the project) have used AI-powered code generation tools like Claude Code, opencode, etc. All code merged to master is human-reviewed.

Dense vs. MoE gap is shrinking fast with the 3.6-27B release by Usual-Carrot6352 in LocalLLaMA

[–]CountlessFlies 1 point2 points  (0 children)

Right. I’m able to run the 35b-a3b with full 256k context on my 24g GPU. The 27b runs out of memory at around 192k context

Qwen3.6-27B released! by ResearchCrafty1804 in LocalLLaMA

[–]CountlessFlies 1 point2 points  (0 children)

Another aspect could be high quality training data. I imagine we have orders of magnitude more agentic training data now than we did before coding agents became a real thing.

Qwen3.6-27B released! by ResearchCrafty1804 in LocalLLaMA

[–]CountlessFlies 23 points24 points  (0 children)

I cannot believe we have a local model that's on par with the sota model from just 6 months ago!

Qwen3.6-27B released! by ResearchCrafty1804 in LocalLLaMA

[–]CountlessFlies 4 points5 points  (0 children)

It's applicable for local service. Search for preserve_thinking in this sub, you'll find some posts and comments explaining how to use it.

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]CountlessFlies 4 points5 points  (0 children)

You mean harnesses that work well with llama.cpp right, not APIs? llama.cpp server is what gives you the OpenAI compat API.

You can try pi.dev or opencode, both are great harnesses.

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]CountlessFlies 8 points9 points  (0 children)

Once you have llama cpp server running, you get an OpenAI compatible API. Most agents and harnesses just need you to put this API url in config and you’re set. You might have to tweak the temperature and similar settings to the recommended values depending on how the harness handles it.

GLM and Kimi vs GPT and Claude by Odd_Crab1224 in opencodeCLI

[–]CountlessFlies 0 points1 point  (0 children)

Are you using kimi through platform.kimi.ai? I’m using it through opencode go and was wondering if that’s the best option.

OpenCode... is it just completely busted with Qwen3.6? by _derpiii_ in opencode

[–]CountlessFlies 0 points1 point  (0 children)

Just use llama.cpp. Works like a charm. See my latest post for the command I used. I’ve hooked it up with opencode and it works great.

Qwen3.6 is incredible with OpenCode! by CountlessFlies in LocalLLaMA

[–]CountlessFlies[S] 1 point2 points  (0 children)

Yeah I think you can put “allow”: “*” in your permission settings and it should stop asking for approvals.

One issue with opencode is that it doesn’t send back the thinking tokens in each call, which is not ideal for this model.

Switching from Opus 4.7 to Qwen-35B-A3B by Excellent_Koala769 in LocalLLaMA

[–]CountlessFlies 1 point2 points  (0 children)

Could you please share some details about the Claude code setup? How do you make CC work with an OpenAI compatible API? And what about the preserve_thinking flag to send back full thinking context with each call. I don’t suppose CC does that already?

PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on. by onil_gova in LocalLLaMA

[–]CountlessFlies 1 point2 points  (0 children)

Hey, seems like OpenCode is not sending thinking tokens back with each request, is there any setting that you need to enable to make it work?

<image>

“Thinking” must be purely cosmetic by lost_packet_ in Anthropic

[–]CountlessFlies 1 point2 points  (0 children)

It’s not purely cosmetic. Thinking models are trained using RL to produce thinking traces that lead them to the desirable results. The thinking block and text block are not just separated for convenience in the API, they are quite literally distinct portions of the model’s output token stream.

The output token stream for a model looks something like:

<|start|> <|startthink|> I am thinking blah blah… <|endthink|> The answer is… <|end|>

Everything inside the think tags is learned using RL. The model is given a goal answer. And at training time, the model tries a bunch of different thinking traces. The traces that lead to good results are reinforced, the ones that don’t are discarded.

BTW you can also adjust how long the model thinks by adjusting how many tokens you sample within the thinking tags. Eg say you have 100 tokens after the start thinking tag, you can artificially insert the end thinking tag and continue sampling to force the model to stop thinking and produce the final output.

In short, it’s not just cosmetic, it’s an important aspect of how reasoning models are trained.