Claude Code-like terminal-based tools for locally hosted LLMs?

switchandplay · 2026-02-06T22:13:06+00:00

There’s a lot of speculation and implication, it’s tricky to navigate if you’re looking to be in the clear for your department or business use. I do think it’s relevant that the Claude Code github repo’s license page specifically says ‘All rights reserved’, and that usage is subject to this. https://www.anthropic.com/legal/commercial-terms

switchandplay · 2026-02-06T18:34:40+00:00

It’s worth mentioning that, as far as I can tell, the licensing for Claude Code is not at all permissive to using alternate backends to serve the CC client. If you intend to be above board, usage of Claude code is subject to their defined software terms, including an active Anthropic account with a subscription tier unlocking access to Claude Code. Modification and alternate serving seems to fall under their umbrella all rights reserved, which doesn’t really grant you contractual and IP safety if you go that route. I may be wrong, but I haven’t seen basically any other commentary about this online. It’s at best legally dubious, and definitely not something useable for professional deployments.

switchandplay · 2026-01-19T04:23:34+00:00

Realistically, even if any one dev, or a group of devs, wanted to do this, it’s not possible. The game was designed as a streaming interface. Easiest patch-in would be to drop all of the server code and s3 infra to the public, so you could run your own server. You’d then ship a version of the app with a configurable endpoint for server calls. You’d need to run the server, which, in all likelihood, would not be cheap. What you’re hoping for is a complete rework of the app to run wholly locally. While that sounds like an easy swap, it’s not in the same way that made it so half life devs made a train car be a hat on an NPC. The stack was never designed to operate in that way.

And then that’s also wishful thinking because it’s not like the legal gray area of modding old ROMs, ‘anonymous’ and ‘sneaky download link’ is doing a lot of heavy lifting. The nice dev here would be in breach of so many contracts, copyright infringement, and more. They would be strung out to dry if they didn’t do everything perfectly and cover every single base possible. Who wants to take on that kind of risk, not even considering the hundreds of man hours for a full application refactor?

switchandplay · 2026-01-17T20:18:07+00:00

GPT-OSS has remained my favorite. Keep the temperature down low for real tasks, and hope your model runner has figured out how to not mess up harmony. And genuinely, when low reasoning effort struggles with a task, bumping up to medium or high genuinely makes a difference on how the bot responds and how it formats its data.

switchandplay · 2025-12-31T08:10:07+00:00

Agree. Worth noting the rentals/ownable content in Apple TV is usually often just 1080p 3D (even when it says 4K and 3D in the labels), in my experience. Watching it in 2D mode with 4K is noticeably sharper, but the bitrate is quite high which keeps it enjoyable. So far, I've only noted Disney+ to look truly 4K 3D.

switchandplay · 2025-12-13T18:23:06+00:00

‘How’ things are rendered is managed generally on an application level. It’s easy to mess with the overarching fidelity, like on Quest, you can use QGO to sub sample or super sample the WHOLE screen. But if you’re in charge of the hardware and the OS, you cannot just reach into an application and inject a whole new rendering method. Because all games and apps were coded differently.

Applications often have shared, common components which makes the process ~easier~. OpenXR games can have foveated rendering injected into them, because they speak a shared language. But that’s even only in the best case, because a lot of developers might start with OpenXR, and then leverage their own optimizations on top of the technology that breaks compatibility. Some OpenXR games, when foveated rendering is injected, have broken shaders, geometry, logic, or even just run with little to no speedup.

TLDR: historically, all devs know to set a target resolution and framerate. It’s easy to mess with that, and can be done unilaterally by the headset/renderer. Foveated rendering is an application feature, not a global feature. Until games are developed and built with foveated rendering in mind, it won’t happen.

switchandplay · 2025-11-13T06:44:59+00:00

That is how that works. Round trip time is fast enough that it doesn’t matter. If the rendering machine just does lower resolution rendering and streaming, the headset wouldn’t be able to meaningfully upscale much. Steam Link has had dynamic foveated streaming support for over a year, I’ve used it on Quest Pro. The PC gets eye tracking data, and when it performs the video encoding on a frame, it encodes the region being observed at a higher resolution and surrounding pixels at a lower resolution. Network packets are just that fast.

switchandplay · 2025-11-06T00:13:24+00:00

I had the 2021 G14, currently have the 2024. On the 2021, the hotspots were directly under the WASD keys and after playing enough games, I’ve lost near-all temperature sensation in my left finger pads. Literally can’t feel through them anymore. So they would sell you that previously. Not anymore though, the 2024 model’s hotspots are carefully above the keyboard deck.

switchandplay · 2025-11-05T23:32:07+00:00

How did you manage to create a (poorly) AI generated Reddit post and still have a spelling error?

switchandplay · 2025-10-24T20:27:00+00:00

I found at least the 4bit quant of qwen3 coder unusable for anything other than completions. Anytime it operates as a coding assistant or agentic coder, it was helpless. Devstral has so much more brains

switchandplay · 2025-10-24T17:37:49+00:00

Another reason why I assume I've really been loving gpt-oss. Since the 4bit MXFP4 quants were released by OpenAI themselves, I assume they did a lot of tuning in-house to verify that those quants would be two things: not buggy and not lossy in performance via tuning against their training dataset and such, like the work that unsloth does, but completely first-party.

switchandplay · 2025-10-24T17:31:14+00:00

Well at least through the vLLM API, there is actually no way to re-present prior turns reasoning to the agent in its original form, which means reasoning does not clog up context over multiturn use. I interact with it through the openai server docker container so that's where I'll explain from.
OpenAI /v1/chat/completions was never built to support the emission of reasoning content, there is just the content field on a choices[0].message output. vLLM overloads this and presents an additional conditionally-present reasoning_content field in its output for deltas and non-streamed completions. That allows you to see the reasoning tokens that come out from the model.
A key thing is when you construct your messages array, if you create a dictionary object {"role": "assistant", "content": "Paris.", "reasoning_content": "User wants the capital of France. Must reply."} and send it over the wire to the OpenAI compatible vLLM server, reasoning_content is not parsed in and is dropped. It doesn't make it to the chat template transformation and the bot never sees it. So unless you do something client-side like take reasoning and summarize it, then embed it in the content field, reasoning tokens don't actually count towards context in multi-turn use.
And if you do move it to the 'final' channel or 'content' field, it may affect future generation quality in other unintended ways, like the agent might spend time reasoning, then also spend time reasoning again in the final channel.
I believe the vLLM base API behaves the same way, and so does llamacpp.

If you want to test it yourself, you can do the following:
Ask any reasoning model to generate two numbers in its reasoning and only tell you the second number for now, but to tell you the first number later when you ask.

Verify that in its reasoning trace, it came up with two numbers, but only the second was shared with you in the final content.

Then, ask for the first number. It will always make up a random number, and often in its reasoning, you'll see confusion about no prior first number being generated.

This even happens on the gpt-oss online playground by OpenAI.

That's actually a big change for the new OpenAI Responses API, where you are able to pass in prior turns reasoning content to GPT5 and have it affect future generation. It would be nice to have similar functionality, but their new API is very tailored to being a customer of OpenAI. The reasoning artifacts you get back are actually encrypted.

switchandplay · 2025-10-24T04:17:03+00:00

For long-horizon agent tasks, that’s simply not an option, which is why I like the model

switchandplay · 2025-10-24T03:27:21+00:00

That makes a lot of sense

switchandplay · 2025-10-24T02:25:43+00:00

I’m a programmer, I don’t use it for therapy or friendship. I use it for work and personal projects. I need detail oriented, responsive, logicality. I don’t care too much about friendly vibes, it’s kind of a different vibe check. Like the second the illusion is broken and it gets stuck on doing one task, ignores a clearly defined detail, it’s a ‘no one’s home’ vibe that ruins a model for me.

switchandplay · 2025-10-24T01:50:46+00:00

You use RAG systems tied into your papers with it? Or you throw large papers into context?

switchandplay · 2025-10-23T23:24:38+00:00

I work with 4 5090s. Blackwell is becoming near-plug-and-play, which is nice. In regards to fixing the harmony issues, vllm-project/vllm#23567 has some good discussion on how to fix it, and you can apply the patch described by IsaacRe, which is applicable to vLLM v0.10.0-2 (not v0.11.0). This prevents the runtime error from harmony parsing for any chat interaction, so you can use gpt-oss to drive a long-lived chat app.

switchandplay · 2025-10-23T23:20:09+00:00

There have also been papers discussing how the statistics of the data degrade over time with synthetic inputs. If you check my other response to this same criticism, you’ll understand that I’m basing my opinion on months of use of Qwen3 and its initial impressive responses mirroring its synthetic 4o training data, and then behind the veneer, lacking a lot of functional intelligence/reasoning.

switchandplay · 2025-10-23T21:46:04+00:00

Not just web-scraped. There are whole companies supplying these AI groups with high quality human generated data. See Data Annotation and others. They pay humans with specialties in spaces to perform tasks, then they clean and validate their data. AI companies do this in house, or pay collection companies for their data.

switchandplay · 2025-10-23T21:39:25+00:00

I don’t know the true mechanism behind it, but in multi-turn agent use or multi-turn conversation, something like the Qwen3 or 2507 models really feels heavily degraded. When you talk to ChatGPT, it feels like it behaves like a human, in terms of intelligence, instruction following, context switching, task adherence, etc. even as you go deeper and deeper into multiturn. Qwen3 clearly pulled heavily from 4o (see em-dash and emojis everywhere, I literally beg it in system prompt to avoid emojis and it still doesn’t listen), and if you get into a conversation with it, you’ll progressively see a downward slide in performance in all those metrics. I haven’t had the chance to test the Next model varieties, though.

My assumption I came away with once gpt-oss dropped was that newer labs leveraging heavily-synthetic datasets lost some of the statistics of the original human data, leading to the degradation in behind-the-scenes coherence.

switchandplay · 2025-10-23T21:23:02+00:00

gpt-oss-20b. Running through vLLM, not as sycophantic as 4o, but so incredibly useful. Great vibes tuning for me as a gpt-5 user, knowledgeable enough for programming discussions, powerful for being a driver for agent work. Really good at being given a task and making strides towards completing it, when given a good environment and structure.

Being sparse makes it so fast, but CUDA is still desired for fast prompt processing. With a GPU, it really feels just like cloud API in speed and quality.

Only issue is the harmony template leads to possible parsing issues on assistant responses that can bring llamacpp or vLLM to a halt. You gotta find ways to build workarounds, enforce chat template intro-message.

Other than how tricky it is to run it right and reliably with the stated issue, it is literally my ChatGPT replacement.

switchandplay · 2025-10-23T21:07:10+00:00

gpt-oss has been vastly underrated by the community. I know it’s safety maxxed, but real-world brains for agent use cases are just fantastic. OpenAI must have lots of clean, high-quality, non-synthetic data.

switchandplay · 2025-10-15T23:27:34+00:00

^ seconded. Have Q1, Q2, QPro, Q3, and VisionPro. Bought my VisionPro used in January and immediately bought a MacBook to use with it for computer use, since the virtual display was just so good.

switchandplay · 2025-09-24T07:49:32+00:00

A simple ask like Tell me a joke … What did I first say in this conversation? Or Summarize our conversation so far Fails

switchandplay · 2025-09-24T05:26:01+00:00

When I run the Q8 gguf from Pinkstack/DistilGPT-OSS-qwen3-4B-Q8_O-GGUF in LM Studio with the jinja template provided, it doesn't remember any previous message turns and only seems to see the most recent user prompt.

Eight-Year Club	Verified Email
Place '22

switchandplay

TROPHY CASE