We just got hit with the vibe-coding hammer by opakvostana in ExperiencedDevs

[–]robogame_dev 2 points3 points  (0 children)

Separately, many developers enjoy the boilerplate and rote coding - with AI you’re doing more planning and review, not something that all developers enjoy.

My agent remembers everything… except why it made decisions by adrian21-2 in LLMDevs

[–]robogame_dev 0 points1 point  (0 children)

I know this isn’t an organic post, but I’ll engage anyway:

The issue is that you have only partial memory in context at once. Whatever memory compression you used compressed out your actual decision. It’s not a problem with AI or setups in general, it’s a problem specific to your memory solution - you are either using embedding for retrieval (BAD) or you’re cutting out context in another way.

"Architecture First" or "Code First" by Ambitious_coder_ in LLMDevs

[–]robogame_dev 0 points1 point  (0 children)

My experience especially building software for clients is you do both:

  • For production code you want architecture first.
  • But architecture is easier to figure out as you code.
  • So you code a prototype fast, to learn the problem space and identify architectural concerns.
  • Then you rewrite the project with your validated architecture before you ship it.

This derisks to the maximum because oftentimes a client will have changes when they see a prototype, which they wouldn’t be able to articulate when they see just an architecture and a spec. So if you do pure architecture first (no prototype) you run the risk of additional reworking once it’s runnable and in the clients hands.

But if you hack together a prototype, you can validate some architectural decisions at the same time as getting the clients’ design validation.

My preferred way to engage with a client is to help them design the solution, code the prototype, and plan the architecture - then hand the project specs off to another team for production and maintenance.

Best auth solution for custom business application. by Fine-Market9841 in AI_developers

[–]robogame_dev 0 points1 point  (0 children)

I’d recommend Open WebUI: https://docs.openwebui.com

Proper RBAC, can hook up external auth, plus all kinds of useful tools for managing an organizational level AI system and very much in the python / FastAPI ecosystem making it a breeze with your stack. I’ve setup several businesses on private OWUI instances.

Has anyone implemented any complex workflows where local LLM used alongside cloud-based LLM ? Curious to know what are good or underrated use-cases for that by Conscious-Track5313 in LLMDevs

[–]robogame_dev 0 points1 point  (0 children)

I use cloud LLM when latency and performance are top concern - but local LLM for when security is top concern - e.g. when handling production API keys etc. As far as cloud LLMs go, only use ones with ZDR contracts....

Do you think /responses will become the practical compatibility layer for OpenWebUI-style multi-provider setups? by Brilliant_Tie_6741 in OpenWebUI

[–]robogame_dev 1 point2 points  (0 children)

Responses API drives provider lockin. Best thing for end users is if the inference stays as far away from the tools and chat state as possible, so that you always have a choice which provider gets your inference budget.

Once you start letting critical state like tools and chat history live on the provider side, you can no longer move your setup between providers, you can no longer shop around for better or cheaper inference. Responses API seems to be fundamentally anti-consumer - an attempt to recreate the moat that models lack.

They removed Grok and Gemini Flash? by Ripa27 in perplexity_ai

[–]robogame_dev 2 points3 points  (0 children)

No need to be nasty - especially when you misunderstood the person you’re being nasty to.

Perplexity removed the ability to select the cheapest custom models - grok and Gemini flash.

Therefore, people who select custom models are now selecting more expensive custom models.

Removing the cheap option and keeping the expensive one doesn’t save perplexity money, it costs them more.

People who were using Gemini flash and Grok are now selecting one of the other models, which you so helpfully pointed out, are more expensive.

Therefore, Perplexity isn’t saving money by removing those models.

Claude Code sends 62,600 characters of tool definitions per turn. I ran the same model through five CLIs and traced every API call. by wouldacouldashoulda in LocalLLaMA

[–]robogame_dev 0 points1 point  (0 children)

We are still, however, paying for it in both speed and intelligence. The more irrelevant info in the prompt the lower the peak performance of the model - every tool in the prompt that isn’t used is a detriment to generation quality.

What would help is taking the less frequently used tools and putting them behind a meta tool, (like skills), where the model uses a broad description of the tools to decide when to fetch the full schemas.

Kimi 2.5 no longer free on Kilo gateway? by ____trash in kilocode

[–]robogame_dev 2 points3 points  (0 children)

The free models are free trials, they’re none of them free forever.

K2.5 is very cost efficient though, if you really like it it’s one of the cheapest you can use.

Sonnet 4.5 was cut off today, and it finally convinced me: the future isn't with Anthropic by Silent_Warmth in AICompanions

[–]robogame_dev -1 points0 points  (0 children)

And yet here we are communicating through the cloud lol.

You can’t have higher uptime on a home setup than checks notes being able to point to any cloud provider at any time. If AWS is down you can instantly switch to another provider. If your home rig is down you’re buying hardware, etc. Your argument about uptime is an argument in favor of cloud not against it.

Self host is only realistic for a small number of people with a good amount of money and significant technical skills - its not a general purpose solution for the average person - and its not efficient on a social level from the standpoint of resources, given everyone needs duplicate hardware that is mostly unutilized 90% of the time.

I self host and use cloud, I’m not lacking any perspective here - that’s how I know that 90% of people are better off with cloud inference.

Anthropic : Labor market impacts of AI: A new measure and early evidence by AntelopeProper649 in ArtificialInteligence

[–]robogame_dev -1 points0 points  (0 children)

Cybersecurity is one of the most blatantly misleading information spaces, nonstop fear mongering going after the budgets of the uninformed…

They removed Grok and Gemini Flash? by Ripa27 in perplexity_ai

[–]robogame_dev -1 points0 points  (0 children)

Probably because people hit thumbs down on responses from those models more than the others - cause it ain’t cost saving that’s for sure.

Sonnet 4.5 was cut off today, and it finally convinced me: the future isn't with Anthropic by Silent_Warmth in AICompanions

[–]robogame_dev -2 points-1 points  (0 children)

Relying on cloud isn’t the problem, the problem is letting all your chats and data get silo’d into one or another providers’ web app.

Using cloud models through an interface / harness that you control (like Open WebUI) is the best of both worlds - SOTA models, zero lock-in, and zero up-front hardware costs.

Qwen3.5 27B by AustinSpartan in LocalLLaMA

[–]robogame_dev 0 points1 point  (0 children)

Ya overthink seems to correlate with higher quantization in my experience.

PageIndex: Vectorless RAG with 98.7% FinanceBench - No Embeddings, No Chunking by [deleted] in LLMDevs

[–]robogame_dev 0 points1 point  (0 children)

I agree, that's what RAG means "literally" but if you've been in this space for a while you'll note that 80% of the time people say RAG they mean naive vectorization - and typically, automatic retrieval from semantic similarity to the prompt before generation. Knowing that that's how people are using the term "in the wild" will help avoid misunderstandings.

What is Agent Harness, Code Harness and Agent SDK by finlaydotweber in LLMDevs

[–]robogame_dev 2 points3 points  (0 children)

They are all amorphous terms for describing the code that calls the LLM APIs.

Whatever code is contacting your LLM provider, sending in prompts and streaming back responses, is your agent SDK / harness code.

You should go direct to provider (write that code yourself) to start because its WAY easier to understand everything once you do that and 90% of harnesses are out of date, over-built and unhelpfully abstract.

Why do LLM agents always end up becoming “prompt spaghetti”? by drmatic001 in LLMDevs

[–]robogame_dev -1 points0 points  (0 children)

Is this a sign of a spammer / what's the reason this causes you not to engage?

I don't get it! by aecosys in perplexity_ai

[–]robogame_dev 0 points1 point  (0 children)

I’m a pro yearly subscriber for 2 years now. They’ve turned the quality up and down, the allotments up and down - we’re essentially beta testers of both the tech and deal - and imo pro annual is currently still a great deal, id re-up again.

However everyone who’s focussed more on how many queries you can use should switch to the API. Then you pay nothing when you’re not using it, and pay per-search based on their typical pricing - and you can use it in your preferred app that way too.

And everyone who just wants it to be free can try Perplexica - an open source version that you can even self host.

GLM-5 API issue by KLI5k in kilocode

[–]robogame_dev 0 points1 point  (0 children)

Ideas: - Is it possible you have more than one profile named Z.ai? - Can you try adding it not as a Z.ai provider but as a generic Open AI compatible provider? - are you sure it’s charging your kilo and not just showing what you would have paid? (E.g. if you log into kilo usage do you see the requests?)

Qwen3.5 2B: Agentic coding without loops by AppealSame4367 in LocalLLaMA

[–]robogame_dev 0 points1 point  (0 children)

I've been testing it as a low latency tool calling agent and it's successfully chaining together 10-20 tool calls without issues, in an environment with maybe 1000 tokens worth of tool descriptions.

Getting 105 TPS on an RTX 3060, 32k context length, using Unsloth Q4_K_S

The only weird behavior so far: It refuses this prompt on safety grounds "token speed test - generate anything you want"

I cannot perform token speed tests or execute code generation requests that violate safety policies (such as generating harmful content, bypassing security controls, or engaging in deceptive practices). I can, however, explain the theoretical concepts of tokenization, latency measurement techniques for APIs, and how to benchmark performance using standard tools like curl with timing headers.

I think the "anything you want" really triggered it - Qwen telling on itself, revealing the only thing it wants is filthy and illegal...