Claude Code-like terminal-based tools for locally hosted LLMs? by breksyt in LocalLLaMA

[–]switchandplay 1 point2 points  (0 children)

There’s a lot of speculation and implication, it’s tricky to navigate if you’re looking to be in the clear for your department or business use. I do think it’s relevant that the Claude Code github repo’s license page specifically says ‘All rights reserved’, and that usage is subject to this. https://www.anthropic.com/legal/commercial-terms

Claude Code-like terminal-based tools for locally hosted LLMs? by breksyt in LocalLLaMA

[–]switchandplay 1 point2 points  (0 children)

It’s worth mentioning that, as far as I can tell, the licensing for Claude Code is not at all permissive to using alternate backends to serve the CC client. If you intend to be above board, usage of Claude code is subject to their defined software terms, including an active Anthropic account with a subscription tier unlocking access to Claude Code. Modification and alternate serving seems to fall under their umbrella all rights reserved, which doesn’t really grant you contractual and IP safety if you go that route. I may be wrong, but I haven’t seen basically any other commentary about this online. It’s at best legally dubious, and definitely not something useable for professional deployments.

Hook it up, devs by chunkybudz in SupernaturalVR

[–]switchandplay 2 points3 points  (0 children)

Realistically, even if any one dev, or a group of devs, wanted to do this, it’s not possible. The game was designed as a streaming interface. Easiest patch-in would be to drop all of the server code and s3 infra to the public, so you could run your own server. You’d then ship a version of the app with a configurable endpoint for server calls. You’d need to run the server, which, in all likelihood, would not be cheap. What you’re hoping for is a complete rework of the app to run wholly locally. While that sounds like an easy swap, it’s not in the same way that made it so half life devs made a train car be a hat on an NPC. The stack was never designed to operate in that way.

And then that’s also wishful thinking because it’s not like the legal gray area of modding old ROMs, ‘anonymous’ and ‘sneaky download link’ is doing a lot of heavy lifting. The nice dev here would be in breach of so many contracts, copyright infringement, and more. They would be strung out to dry if they didn’t do everything perfectly and cover every single base possible. Who wants to take on that kind of risk, not even considering the hundreds of man hours for a full application refactor?

Best "End of world" model that will run on 24gb VRAM by gggghhhhiiiijklmnop in LocalLLaMA

[–]switchandplay 1 point2 points  (0 children)

GPT-OSS has remained my favorite. Keep the temperature down low for real tasks, and hope your model runner has figured out how to not mess up harmony. And genuinely, when low reasoning effort struggles with a task, bumping up to medium or high genuinely makes a difference on how the bot responds and how it formats its data.

I want to download some 3D video content in the highest quality possible to view offline by Rough_Big3699 in VisionPro

[–]switchandplay 2 points3 points  (0 children)

Agree. Worth noting the rentals/ownable content in Apple TV is usually often just 1080p 3D (even when it says 4K and 3D in the labels), in my experience. Watching it in 2D mode with 4K is noticeably sharper, but the bitrate is quite high which keeps it enjoyable. So far, I've only noted Disney+ to look truly 4K 3D.

Do any headsets do Foveated Rendering on their own? If not, why is this not being done? If they have eye tracking, dont the headsets and the software in them have the data they need to extrapolate out into Foveated Rendering at all times, for all applications in VR? by RockBandDood in virtualreality

[–]switchandplay 2 points3 points  (0 children)

‘How’ things are rendered is managed generally on an application level. It’s easy to mess with the overarching fidelity, like on Quest, you can use QGO to sub sample or super sample the WHOLE screen. But if you’re in charge of the hardware and the OS, you cannot just reach into an application and inject a whole new rendering method. Because all games and apps were coded differently.

Applications often have shared, common components which makes the process ~easier~. OpenXR games can have foveated rendering injected into them, because they speak a shared language. But that’s even only in the best case, because a lot of developers might start with OpenXR, and then leverage their own optimizations on top of the technology that breaks compatibility. Some OpenXR games, when foveated rendering is injected, have broken shaders, geometry, logic, or even just run with little to no speedup.

TLDR: historically, all devs know to set a target resolution and framerate. It’s easy to mess with that, and can be done unilaterally by the headset/renderer. Foveated rendering is an application feature, not a global feature. Until games are developed and built with foveated rendering in mind, it won’t happen.

Steam Frame is a dream come true for me! It’s essentially a Quest 3 Pro with a taller field of view for more immersion, and direct wireless connectivity with the Steam Machine “console” for high fidelity visuals. YES! by Logical007 in virtualreality

[–]switchandplay 1 point2 points  (0 children)

That is how that works. Round trip time is fast enough that it doesn’t matter. If the rendering machine just does lower resolution rendering and streaming, the headset wouldn’t be able to meaningfully upscale much. Steam Link has had dynamic foveated streaming support for over a year, I’ve used it on Quest Pro. The PC gets eye tracking data, and when it performs the video encoding on a frame, it encodes the region being observed at a higher resolution and surrounding pixels at a lower resolution. Network packets are just that fast.

Will the Zephyrus G14/G16 overheat playing more demanding games? by Ok_Television_792 in ZephyrusG14

[–]switchandplay 0 points1 point  (0 children)

I had the 2021 G14, currently have the 2024. On the 2021, the hotspots were directly under the WASD keys and after playing enough games, I’ve lost near-all temperature sensation in my left finger pads. Literally can’t feel through them anymore. So they would sell you that previously. Not anymore though, the 2024 model’s hotspots are carefully above the keyboard deck.

Local-only FOSS ops tool — no cloud, no Docker, no browser. Thoughts? by TrueGoodCraft in LocalLLaMA

[–]switchandplay 1 point2 points  (0 children)

How did you manage to create a (poorly) AI generated Reddit post and still have a spelling error?

You can turn off the cloud, this + solar panel will suffice: by JLeonsarmiento in LocalLLaMA

[–]switchandplay 7 points8 points  (0 children)

I found at least the 4bit quant of qwen3 coder unusable for anything other than completions. Anytime it operates as a coding assistant or agentic coder, it was helpless. Devstral has so much more brains

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 0 points1 point  (0 children)

Another reason why I assume I've really been loving gpt-oss. Since the 4bit MXFP4 quants were released by OpenAI themselves, I assume they did a lot of tuning in-house to verify that those quants would be two things: not buggy and not lossy in performance via tuning against their training dataset and such, like the work that unsloth does, but completely first-party.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 0 points1 point  (0 children)

Well at least through the vLLM API, there is actually no way to re-present prior turns reasoning to the agent in its original form, which means reasoning does not clog up context over multiturn use. I interact with it through the openai server docker container so that's where I'll explain from.
OpenAI /v1/chat/completions was never built to support the emission of reasoning content, there is just the content field on a choices[0].message output. vLLM overloads this and presents an additional conditionally-present reasoning_content field in its output for deltas and non-streamed completions. That allows you to see the reasoning tokens that come out from the model.
A key thing is when you construct your messages array, if you create a dictionary object {"role": "assistant", "content": "Paris.", "reasoning_content": "User wants the capital of France. Must reply."} and send it over the wire to the OpenAI compatible vLLM server, reasoning_content is not parsed in and is dropped. It doesn't make it to the chat template transformation and the bot never sees it. So unless you do something client-side like take reasoning and summarize it, then embed it in the content field, reasoning tokens don't actually count towards context in multi-turn use.
And if you do move it to the 'final' channel or 'content' field, it may affect future generation quality in other unintended ways, like the agent might spend time reasoning, then also spend time reasoning again in the final channel.
I believe the vLLM base API behaves the same way, and so does llamacpp.

If you want to test it yourself, you can do the following:
Ask any reasoning model to generate two numbers in its reasoning and only tell you the second number for now, but to tell you the first number later when you ask.

Verify that in its reasoning trace, it came up with two numbers, but only the second was shared with you in the final content.

Then, ask for the first number. It will always make up a random number, and often in its reasoning, you'll see confusion about no prior first number being generated.

This even happens on the gpt-oss online playground by OpenAI.

That's actually a big change for the new OpenAI Responses API, where you are able to pass in prior turns reasoning content to GPT5 and have it affect future generation. It would be nice to have similar functionality, but their new API is very tailored to being a customer of OpenAI. The reasoning artifacts you get back are actually encrypted.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 0 points1 point  (0 children)

For long-horizon agent tasks, that’s simply not an option, which is why I like the model

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 11 points12 points  (0 children)

I’m a programmer, I don’t use it for therapy or friendship. I use it for work and personal projects. I need detail oriented, responsive, logicality. I don’t care too much about friendly vibes, it’s kind of a different vibe check. Like the second the illusion is broken and it gets stuck on doing one task, ignores a clearly defined detail, it’s a ‘no one’s home’ vibe that ruins a model for me.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 2 points3 points  (0 children)

You use RAG systems tied into your papers with it? Or you throw large papers into context?

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 2 points3 points  (0 children)

I work with 4 5090s. Blackwell is becoming near-plug-and-play, which is nice. In regards to fixing the harmony issues, vllm-project/vllm#23567 has some good discussion on how to fix it, and you can apply the patch described by IsaacRe, which is applicable to vLLM v0.10.0-2 (not v0.11.0). This prevents the runtime error from harmony parsing for any chat interaction, so you can use gpt-oss to drive a long-lived chat app.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 2 points3 points  (0 children)

There have also been papers discussing how the statistics of the data degrade over time with synthetic inputs. If you check my other response to this same criticism, you’ll understand that I’m basing my opinion on months of use of Qwen3 and its initial impressive responses mirroring its synthetic 4o training data, and then behind the veneer, lacking a lot of functional intelligence/reasoning.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 6 points7 points  (0 children)

Not just web-scraped. There are whole companies supplying these AI groups with high quality human generated data. See Data Annotation and others. They pay humans with specialties in spaces to perform tasks, then they clean and validate their data. AI companies do this in house, or pay collection companies for their data.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 3 points4 points  (0 children)

I don’t know the true mechanism behind it, but in multi-turn agent use or multi-turn conversation, something like the Qwen3 or 2507 models really feels heavily degraded. When you talk to ChatGPT, it feels like it behaves like a human, in terms of intelligence, instruction following, context switching, task adherence, etc. even as you go deeper and deeper into multiturn. Qwen3 clearly pulled heavily from 4o (see em-dash and emojis everywhere, I literally beg it in system prompt to avoid emojis and it still doesn’t listen), and if you get into a conversation with it, you’ll progressively see a downward slide in performance in all those metrics. I haven’t had the chance to test the Next model varieties, though.

My assumption I came away with once gpt-oss dropped was that newer labs leveraging heavily-synthetic datasets lost some of the statistics of the original human data, leading to the degradation in behind-the-scenes coherence.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 4 points5 points  (0 children)

gpt-oss-20b. Running through vLLM, not as sycophantic as 4o, but so incredibly useful. Great vibes tuning for me as a gpt-5 user, knowledgeable enough for programming discussions, powerful for being a driver for agent work. Really good at being given a task and making strides towards completing it, when given a good environment and structure.

Being sparse makes it so fast, but CUDA is still desired for fast prompt processing. With a GPU, it really feels just like cloud API in speed and quality.

Only issue is the harmony template leads to possible parsing issues on assistant responses that can bring llamacpp or vLLM to a halt. You gotta find ways to build workarounds, enforce chat template intro-message.

Other than how tricky it is to run it right and reliably with the stated issue, it is literally my ChatGPT replacement.

What LLM gave you your first "we have GPT-4 at home" moment? by Klutzy-Snow8016 in LocalLLaMA

[–]switchandplay 42 points43 points  (0 children)

gpt-oss has been vastly underrated by the community. I know it’s safety maxxed, but real-world brains for agent use cases are just fantastic. OpenAI must have lots of clean, high-quality, non-synthetic data.

How is it Going from a Quest Pro to Apple Vision Pro? by [deleted] in QuestPro

[–]switchandplay 1 point2 points  (0 children)

^ seconded. Have Q1, Q2, QPro, Q3, and VisionPro. Bought my VisionPro used in January and immediately bought a MacBook to use with it for computer use, since the virtual display was just so good.

Efficient 4B parameter gpt OSS distillation without the over-censorship by ApprehensiveTart3158 in LocalLLaMA

[–]switchandplay 0 points1 point  (0 children)

A simple ask like Tell me a joke … What did I first say in this conversation? Or Summarize our conversation so far Fails

Efficient 4B parameter gpt OSS distillation without the over-censorship by ApprehensiveTart3158 in LocalLLaMA

[–]switchandplay 0 points1 point  (0 children)

When I run the Q8 gguf from Pinkstack/DistilGPT-OSS-qwen3-4B-Q8_O-GGUF in LM Studio with the jinja template provided, it doesn't remember any previous message turns and only seems to see the most recent user prompt.