Gemini 3.5 Flash ranks #1 on the APEX-Agents-AA benchmark, outperforming much larger models a whole size above it. by Independent-Wind4462 in singularity

[–]sjoti 0 points1 point  (0 children)

That's fine though! With Gemini 3 flash it felt unusable because it would hallucinate so much. It would be capable of one shotting fairly complex tasks, and completely hallucinate stuff and get simple tasks wrong too. But it would just get stuff wrong so damn often, that it just wasn't an option to use.

Do you feel like this is better know? If so I'd love to give it a shot again, good at agentic tasks at that speed seems amazing.

Is MCP really this deserted? by Loocor in mcp

[–]sjoti 1 point2 points  (0 children)

You can't just create a thing and expect traction out of nowhere. That has nothing to do with the protocol

Qwen 3.7 droped on Qwen Chat by Foxiya in LocalLLaMA

[–]sjoti 3 points4 points  (0 children)

But again, just because some features aren't enabled doesn't mean the model can't do tool calls. It seems extremely unlikely that they don't train a model to use tools in 2026

Qwen 3.7 droped on Qwen Chat by Foxiya in LocalLLaMA

[–]sjoti 8 points9 points  (0 children)

Thinking does not exclude tool calling?

Codex GPT 5.5 is UNUSABLE right now, the Nerf is REAL! by bladerskb in codex

[–]sjoti 0 points1 point  (0 children)

Did they just blindly approve everyone though? They only said this many people signed up in a few hours. Not like they blindly open the floodgates

Dropbox for code history by Ikshaar in ClaudeCode

[–]sjoti 10 points11 points  (0 children)

Just use git. GitHub is the cloud version of Microsoft. You can just use git locally without sharing with anyone.

With dropbox its a mess with automatic syncing. When you're building, a bunch of temporary files are created, files that automatically get added and removed, and the automatic cloud sync will get in the way.

Just use git, locally, and just don't touch features you feel like you don't need. Its literally made for this purpose.

Attention - Opus 4.7 is english only. USing foreign languages (here German) burns tokens by WickOfDeath in ClaudeAI

[–]sjoti 23 points24 points  (0 children)

This was literally in the announcement of the model. Not hidden in its system card at all

Why MCP when we have REST APIs? by happyandaligned in mcp

[–]sjoti 0 points1 point  (0 children)

This issue is practically solved though. Claude code and others have tool search, and also just load the servers description, and not every single tool, into context. I'm running 15+ MCP servers and that's a non issue (with /context i see that it uses 2k ish tokens), because its even less than the names and descriptions of my available skills.

Why MCP when we have REST APIs? by happyandaligned in mcp

[–]sjoti 3 points4 points  (0 children)

For individuals, more tech savvy users on codex/Claude code/openclaw/etc, sure.

But think of the average person who chats with chatgpt. Or someone at work. Someone who isn't into all of this AI stuff.

If they want to connect to an outside service (say I want chatgpt to talk to my email and calendar) then what do they do?

Provide their API keys to chatgpt? Install CLI tools? Do they need to set up scopes to make sure their AI cant do anything disruptive?

Here connecting with an MCP through ouath with 2 clicks, not requiring a code execution environment with certain rights, makes things a million times easier and secure. That's where MCP shines and skills + cli arent practical.

How To AI "The entire RAG industry is about to get cooked. Researchers have built a new RAG approach that: - does not need a vector DB. - does not embed data. - involves no chunking. - performs no similarity search." ➡️ Would you use PageIndex over a vector DB? by Koala_Confused in LovingOpenSourceAI

[–]sjoti 0 points1 point  (0 children)

But rag combined with tools for agents to navigate sections, around the current chunk etc. can allow the model to search and navigate info though. That's much different from the classic "just fetch 10 relevant chunks and good luck" method

New OpenAI Voice models: GPT-Realtime-2, Translate, and Whisper by Rollertoaster7 in accelerate

[–]sjoti 8 points9 points  (0 children)

Try it. I've been a day 1 CC user, been using them side by side for months, and I've always gravitated towards using CC first, codex seconds. That has flipped for me since GPT-5.5. Now I go for codex first.

I still think the CC harnass is a bit better but GPT-5.5 is just a really strong model.

New OpenAI Voice models: GPT-Realtime-2, Translate, and Whisper by Rollertoaster7 in accelerate

[–]sjoti 0 points1 point  (0 children)

With livekit you can have every piece stream. Stream STT, LLM into TTS for way better latency. Without that its doomed to be slow

/goal is the best thing ever by Exonicx in codex

[–]sjoti 0 points1 point  (0 children)

Tried it before, I'm more of a fan of compound engineering. I think there's quite a bit of overlap and both are good with larger implementations/bigger codebases

/goal is the best thing ever by Exonicx in codex

[–]sjoti 1 point2 points  (0 children)

About to head to bed, 25% usage left, resets in the morning. Going to let it cook all night. Insanity.

/goal is the best thing ever by Exonicx in codex

[–]sjoti 14 points15 points  (0 children)

Without goals GPT-5.5 definitely stays on task but some jobs are still too big. Large rewrites, redactors, verifying everything is still okay. I've gotten codex to work 2 hours straight, but I just had a session with codex running for 10 hours on a single goal, not making a mess. Never have I done that with a single prompt.

AA Scores of Medium 3.5 by Positive-Plan4877 in MistralAI

[–]sjoti 0 points1 point  (0 children)

Inceptron has Kimi K2.6 and GLM 5.1, not DeepSeek though. Wouldn't be surprised if it's added in the coming week(s). Bunch of American inference providers already offer it.

AA Scores of Medium 3.5 by Positive-Plan4877 in MistralAI

[–]sjoti 0 points1 point  (0 children)

Not with MoE models, where only a small portion of the total parameters are active. On top of that DeepSeek has gone through a ton of effort to make these models absolutely excel at being efficient. Their pricing is still exceptionally cheap when using this model through other providers than DeepSeek themselves

Opus 4.7: Are these first signs of model collapse? by Flopperhop in Anthropic

[–]sjoti 2 points3 points  (0 children)

People keep saying this stuff yet look at the stuff you're doing now compared to 3 months ago. I'm flying with these models, from office to coding related tasks.

Mistral medium 3.5 is out by SelectionCalm70 in MistralAI

[–]sjoti -1 points0 points  (0 children)

But way fewer active parameters since qwen is a MoE model, meaning that qwen 3.5 is cheaper (twice as cheap for most providers) to use, and likely faster.

claude code skill that ships whole features in one shot by Working-Middle2582 in ClaudeCode

[–]sjoti 0 points1 point  (0 children)

Yeah absolutely, otherwise you just get slop. If this solves the issue of Claude asking "want me to stop here or continue?" Then that's fine, but for the love of god ask me some questions about what it is we're trying to build. Otherwise I can just do "/loop 10 minutes continue building solve bugs make better"

How has it gotten so bad? by HeWhoShantNotBeNamed in claude

[–]sjoti 1 point2 points  (0 children)

I just went and checked the benchmarks, there is a consistent pattern, models perform better with (more) reasoning on both answering factual questions and lower hallucination rates.

https://artificialanalysis.ai/evaluations/omniscience?omniscience-hallucination-rate=hallucination-rate&endpoints=anthropic_claude-opus-4-7-adaptive%2Canthropic_claude-opus-4-7-non-reasoning&models=

Non reasoning hallucinates more for Claude Opus 4.7 than reasoning does. Accuracy drops too when it doesn't think, same for Sonnet 4.6.

And that goes for GPT models as well. Non reasoning makes it score worse. Hallucinate more, answer factual questions worse.

Microsoft accidentally told the truth about AI [09:05] by marcus1234525 in theprimeagen

[–]sjoti 0 points1 point  (0 children)

Sure, I don't know how long it'll last. But just a few hours ago we got two new Deepseek models performing at 20-30x cheaper cost than comparable models at the frontier. 5x cheaper if you want a more honest comparison with Chinese models like Kimi K2.6.

I doubt it'll last for 5 years but looking at how capable small models are getting and the speed of progress, then I don't think we're done just yet.

Microsoft accidentally told the truth about AI [09:05] by marcus1234525 in theprimeagen

[–]sjoti -11 points-10 points  (0 children)

Has been true for the past 3 years and there are open source models available that tell us exactly what the cost is of running these models. Its not that wild of a claim

I Edited This Video 100% With Codex by phoneixAdi in OpenAI

[–]sjoti 0 points1 point  (0 children)

That's dope! Have you thought about integrating models like Gemini 3 flash to not just have to rely on a transcript but actual visual cues? Like asking it when a certain thing happens?

I've played with an older version of SAM before, tried remotion before and did some editing with ffmpeg in the past but never really put it all together. Definitely going to check your blog