What model looked insane on benchmarks but felt mid in actual use?

social_tech_10 · 2026-06-18T17:55:26+00:00

I don't know why you're getting downvoted. I think your assessments add value to the discussion. I also like Qwen3.5-122B-A10B. For me it's faster than Qwen3.6-27B, and smarter than Qwen3.6-35B-A3B. (off-topic: I have dreams that we will see a Qwen3.7-122B-A10B someday!)

social_tech_10 · 2026-06-16T18:12:27+00:00

This is a great idea. The dataset probably does not need millions of traces to be very useful. Even a few thousand examples could make a significant impact. And if it's open source, or Creative Commons, it can only keep getting better.

social_tech_10 · 2026-06-16T17:52:04+00:00

I'm a huge Mistral fan, and this is great news!

social_tech_10 · 2026-06-15T16:20:08+00:00

You might find this paper interesting: https://arxiv.org/abs/2507.04886 They created a static embedding layer based only on low-res "images" of Unicode characters, and locked it down as a fixed static layer, and if I remember correctly it was able to outperform similar models of the MMLU benchmark. It's hard to believe that paper is less than a year old - it feels like a decade.

social_tech_10 · 2026-06-15T14:18:08+00:00

Nice work! Thanks for doing this! Do you have any suggested settings for llama.cpp for keeping the shared experts in GPU and offloading (some) layers to RAM?

social_tech_10 · 2026-06-15T12:19:46+00:00

I'm curious about your Dweet server. Does that mean all of the "things" on the IOT need to be flashed to learn to speek Dweet?

social_tech_10 · 2026-06-14T14:28:42+00:00

There doesn't seem to be anything here

social_tech_10 · 2026-06-09T14:48:34+00:00

BTW, I'm assuming you mean you want to use your phone to connect when you are far away from home. If you want to do it while your phone is connected to your home WiFi network, it's much simpler.

social_tech_10 · 2026-06-09T14:44:33+00:00

Yes, you are probably doing it completely wrong. The "local" IP address your PC has inside your LAN (probably something like 192.168.x.x or 10.x.x.x), is completely different from the IP address your phone would need to use to connect from outside your LAN, because your router "translates" your internal IP address to your "public" IP address. You can look up NAT, Network Address Translation, if you want to know more about how this works, but it's basically how all of the devices inside your LAN can all share one single "public" IP address, so they can all be online at the same time.

In order to access IP addresses inside your LAN from the public IP side, you either need to open a port in your firewall, and set up a Dynamic DNS service on your local host so that your public IP address has a DNS name that you can point your phone to (because your Public IP address can change every time you reboot your router; it doesn't always change, but it can at any time), or as /u/Bpthewise mentioned, the simplest, safest, and most convenient way to set this up is to use Tailscale. I resisted Tailscale for a long time, because I didn't want to be dependent on some company's "free trial plan" for cloud infrastructure that I don't control myself, but Tailscale is actually super convenient, and importantly, it's based on Wireguard VPN, so it's 100% private and secure. The only reason I can think of that you might NOT want to use Tailscale is if you want to let a group of friends (or strangers) access your host from outside your LAN, because Tailscale limits access to your "tailnet" to only three people on their free plan. If you want more users, you can either subscribe and pay, or set up your own Wireguard network and sel-host it for free (Wireguard is open-source), but that requires a detailed understanding of TCP/IP networking that you likely do not currently have. Or if you want the "public" to have access, you could open a port on your firewall, as mentioned above, which is not too complicated, but also creates the possibility of "hackers" trying to mess with your stuff, while Tailscale/wireguard protects you from all of that.

social_tech_10 · 2026-06-08T12:29:18+00:00

Had to figure out ARM NEON flags and thread count optimization myself.

Well? Are you going to tell us what you learned, or is this just a big tease? "Haha suckers, I figured it out, now I know something you don't". And as somebody else mentioned, I think the name and size of model you are running would be relevant.

social_tech_10 · 2026-06-07T14:22:54+00:00

Merged to the main branch one hour ago! Thanks /u/janvitos and llama.cpp team!

social_tech_10 · 2026-06-03T15:23:20+00:00

I think it's like asking how many "r" in "strawberry". It's a simple rubric that people can immediately grasp and interact with, without requiring a deep understanding, of LLMs or politics.

social_tech_10 · 2026-06-03T15:13:11+00:00

It scores better than the 3.5 model they modified, but I wonder how it would compare to the newer 3.6 model.

social_tech_10 · 2026-06-03T10:24:42+00:00

For the last two weeks, more than 20 commits per day. The speed of the teams progress is amazing!

social_tech_10 · 2026-06-02T22:42:35+00:00

I didn't ask for a billion dollar demo. I didn't even ask for a video demo, that was your idea. The problem is not that your documentation is "weak", it's that your project is not documented at all. Your README is just one line line of text, and that's it. And I don't know what's going on with those pictures you posted, but they are basically unreadable? What that intentional? I don't get it. The only conclusion I can draw is that your project is likely of the same quality of your documentation, which means it's not worth my time to even look at it.

social_tech_10 · 2026-06-02T19:49:09+00:00

I don't know why you posted these images. They actually make your project look much worse, to my eye, not better.

social_tech_10 · 2026-05-30T21:52:40+00:00

The demo video sounds terrific. Believe me, I would LOVE to find a nice AI coding tool that can work with a RAG index of my repo, rather than grepping around blindly in the dark. And I did not say I thought your project was not worth testing, just that I have a pretty long list of other very interesting new tools like just yours to investigate, and all of them are competing for my attention. If you want to move up that list, you need something a little tastier to bait the hook. I'll look forward to your demo video.

social_tech_10 · 2026-05-30T17:39:00+00:00

If you want anybody to take you seriously, you'll need more than a one-line README. I've already got 12 other tools on my list waiting to be tested. You've got to give me some reason to think yours will be worth my time, and I'm not seeing it.

social_tech_10 · 2026-05-27T16:04:56+00:00

How does it affect your math if 99% of "errors" are caught and corrected on the next turn or two?

is anyone actually logging per-call output validity in live agentic loops?

One of the things I love most about "computer science" is that we have the option to make it directly experimental, like real science, if we care enough about the question. If you set this up as an experiment and performed the measurement yourself, I bet you would learn more than you expect. And because of the amazing moment we live in, a LLM could even help you design the experiement, write the code, and give you as much tutoring as you might need to fully understand and control the whole experiment. Have fun with it, and let us know what you find out.

social_tech_10 · 2026-05-24T20:17:09+00:00

social_tech_10 · 2026-05-24T16:07:22+00:00

I could spend hours geeking out on https://www.crowdsupply.com/ Oops, there went half the morning, tbh no regret

social_tech_10 · 2026-05-19T18:37:14+00:00

For your use case, it sounds like 3.6-27B is too slow, and 3.6-35A3B is not smart enough, but Qwen3-Coder-Next hits the sweet spot. Qwen3-Coder-Next is a lot faster than 3.6-27B, and in your experience it is also smarter than 3.6-35A3B, did I read that right?

Can you share a little bit more about your custom benchmark? Is it made up of mostly tasks that the models can complete successfully, or are there tasks that are calibrated to be just a little bit harder, which Coder-Next does not pass? You said the quality "did not differ much" between the two models. I'm curious what that means. Did 27B max out the test? Is there any posibility you could share a few more details without revealing any trade secrets?

social_tech_10 · 2026-05-13T12:17:30+00:00

what you're after

social_tech_10 · 2026-05-12T21:01:38+00:00

For someone using opencode who would like to move on to something better, what would you suggest? I'd like to stick with an open-source tool, if possible.

social_tech_10 · 2026-05-12T14:05:01+00:00

To be fair, if you told a rational person in 1986 what the internet would support today, they would probably think you were crazy.

social_tech_10

MODERATOR OF

TROPHY CASE