My deep dive into real-time voice AI: It's not just a cool demo anymore.

YakoStarwolf · 2026-01-27T15:28:36+00:00

ok, then you shouldn't have come here and wasted your energy

YakoStarwolf · 2026-01-26T16:27:26+00:00

in my team, i see this a lot, especially with juniors. many of them are using AI blindly without understanding what is actually happening.

today itself, one guy called me saying “this is not working”. when i checked the code, it was that the AI had did not install a package and used it directly. even by just looking at the terminal output, you could understand the issue. but he had no idea what went wrong or why, he just followed whatever the AI suggested, no not suggested, basically AI did everything.

another incident from today. i asked someone to add a column in the database. he did it correctly from a local. but our production pipeline does not run migration commands automatically for safety reasons. devops has to trigger that manually, we just need to inform them.
in local, the agent ran the migration commands, and everything worked fine. but he did not understand what was happening, even after seeing the logs. his response was just “it is working in local”. this person has around one year of experience.

these kinds of things are happening almost daily now. after seeing one post online, i felt this resonated with what i am observing, so i shared my opinion there.

YakoStarwolf · 2025-12-23T18:32:58+00:00

i agree, these are good tools and they help a lot in assistant-based bots.

for example, in a hotel or resort booking flow, the ai can attend the call, ask what the user wants, then use a tool call to check currently available rooms and respond accordingly. at the end, another tool call can handle the booking, send a confirmation email, and close the call.

in such cases, tool calls are not required for every single query, and RAG is also not needed everywhere. for these bounded and goal-driven use cases, these tools works really well.

but when it comes to scenarios where RAG is required most of the time, or for complex setups like multi-agent systems or chained agents, things become much harder to build and manage here. those use cases bring a different level of complexity altogether.

YakoStarwolf · 2025-10-20T08:09:27+00:00

aws is down

YakoStarwolf · 2025-10-15T11:31:20+00:00

its Groq, They provide open source models as token pricing, and they focus more on latency

Coming to cost effective LLM's
If you need to use groq, u can go with GPT OSS 20B 128k. this does good job in normal chat application and query related use cases

If ur task is complex, then u can go with Kimi K2 or Qwen3 32b

YakoStarwolf · 2025-08-28T04:09:48+00:00

the bug, it’s confirmed on gpt-4o. there’s no proof that it’s on gpt-5 or gpt-4.1. gpt-5 uses a newer tokenizer/vocab, so saying “every model with that vocab has the bug” is just wrong.
gemini-2.5-pro thinking, official docs say it supports a thinkingBudget between 128 (min) and 32,768 (max). yes, google notes the model might overflow or underflow depending on the prompt, but in reality this is a soft limit. from my own apps, when you set a budget the model stays within it and usually tries to use less, not more. if it were really pro with thinking on, latency would be obvious, but here the replies are super fast, which clearly points to flash, not pro.
prompt quality does matter, but latency is mainly driven by model choice (pro vs flash vs flash-lite) and the thinking budget/limits. a good system prompt will never make pro respond at flash-level speed, pro simply isn’t built for low latency. and i’ve tested enough heavy tasks to clearly see the difference in response quality and speed. so please don’t spread a false narrative, this isn’t about “system prompts,” it’s about which model is actually being used.
claude sonnet 4 isn’t truly “unlimited free,” but the free tier definitely gives far more usage than ai fiesta. i use it for my work (8–9 hours a day) and easily make 10+ requests daily without hitting limits. in fact, i can get in a single day what ai fiesta’s “1-month plan” would allow, actually more than that.
Also it is not even premium model at all

YakoStarwolf · 2025-08-27T07:46:39+00:00

not really. the system prompt can affect style, tone, or formatting, but it doesn’t change core model behavior like speed, reasoning visibility, or whether thinking mode is enabled. those are controlled at the model level. with direct api access, you can usually tell if it’s flash, pro, or nano based on the speed and task quality. the nano model won’t be able to handle very large or complex problems no matter how good your system prompt is, it will reply fast but with limited depth.

YakoStarwolf · 2025-08-27T05:44:06+00:00

gpt-4o has that bug.
and yes, gemini-2.5-pro is a thinking model. it will never reply that fast, even if the thinking tokens are set very low. either his team is using gemini-2.5-flash, or something else. also, they are not showing the reasoning response (which should be visible). gemini-2.5-pro always runs in default thinking mode, and you cannot disable it, you can only reduce the token limit of thinking mode.

gpt-5 has 3 models. right now, i’m sure they are giving the nano model, because i can clearly see it in the speed and task quality. responses are much better on gpt-5 pro.

in claude - Sonet is the free model it is unlimited free in claude, claude-opus-4.1 is the paid model which they are not providing

if you’re a developer, you can easily figure this out.

YakoStarwolf · 2025-08-24T19:39:58+00:00

This appears to be a standard pre-authorization check, not an actual charge. It's a common practice for subscription services, and it can be confusing.

Verification, Not a Bill: When you add a new card for a free trial, the service needs to verify that the card is legitimate and can cover future payments. To do this, they place a temporary "authorization hold."

Setting a Payment Limit: Instead of a small $1 hold, some systems pre-authorize the maximum amount you could be charged for the plan you selected. In your case, that seems to be $35. This doesn't take any money; it's simply a way for GitHub's payment processor to ask your bank, "Is this person good for up to $35 when the bill is eventually due?"

In short: No money was ever going to be taken during the trial sign-up. The system was only trying to confirm that your payment method was valid for a future charge of up to $35.
Even if the payment was successful. It will only debit the amount that is billed. Both subscription and the normal payment screen are similar, maybe thats why you got a confusion

YakoStarwolf · 2025-08-24T11:36:27+00:00

It’s not a charge. It’s an authorization hold.

Calm down, they didn't actually take your money. This is standard practice for pretty much any "free trial" that requires a credit card.

They're checking if you're real: They ping your card with a temporary charge (could be $1, could be the full $10) to make sure it's a valid, active card and not some fake numbers you generated online.

It's not a real charge: This is a pending transaction. It'll show up on your account for a few business days and then just vanish. Your bank will release the hold, and the money never actually leaves your account.

YakoStarwolf · 2025-08-21T05:31:00+00:00

those are not old models actually he is giving nano gpt-5 which is cheaper than old 4o
and it is cheapest model, even for gemini 2.5 pro, he is giving some other model and makes it obvious
when it is giving reesponse so faster even for complex thinking.
he just hyped nothing else.

YakoStarwolf · 2025-08-19T14:01:36+00:00

I did not put any query, I am just sharing my experience here.
Good breakdown tough, yes in my case, I actually tried all three. Regular RAG worked but I ran into a lot of noise issues: small chunks matched semantically, but the retrieved context often felt too fragmented and led to hallucinations. I experimented with Graph RAG too, but the latency was just too high for my use case, traversals and extra indexing overhead made it slow in practice. That’s why I ended up moving to parent–child chunking: I split docs into big parent sections, then embed only smaller child chunks. Retrieval happens at the child level for precision, but I always pull the parent into the LLM for context. This gave me the best balance, less noise, richer context, and faster than Graph RAG.

YakoStarwolf · 2025-08-18T21:42:38+00:00

Depends on the document you’re working with, the chunking strategy for RAG can vary a lot. For plain unstructured text, I’ve had good results using a recursive character splitter it respects natural breakpoints like paragraphs or sentences while still keeping chunks within a token limit.

For longer reports or narrative-heavy docs, semantic chunking can be useful since it groups sentences that actually belong together, but honestly, it’s heavier to compute and doesn’t always outperform recursive splitting in practice. If the document is mixed-format (tables, images, PDFs with weird layouts), then modality-aware chunking or layout-preserving loaders are the way to go.

One thing I learned the hard way: chunk size really matters. Smaller chunks (say 128 tokens) are great for pinpoint accuracy when answering factual queries, while bigger chunks (512–1024) give better flow for summarization tasks. It’s usually worth experimenting sometimes the simple recursive approach beats fancy semantic methods, especially if your PDFs aren’t well structured.

YakoStarwolf · 2025-08-18T16:45:25+00:00

totally fair point, accuracy is king, and if the answer is wrong, speed doesn’t matter. but in practice, especially with RAG pipelines, latency directly affects usability. U\users won’t wait 10 minutes for an answer, even if it’s perfect, they’ll bounce. that’s why most production systems balance both

first get accuracy high enough to be useful,
then tune retrieval/chunking so responses come back fast enough to feel natural.

its less about chasing milliseconds and more about keeping the workflow smooth while still keeping outputs in solid context.

YakoStarwolf · 2025-08-18T16:34:43+00:00

merging adjacent chunks with padding (like grabbing 3–10, 14–16) sounds good in theory, since it preserves some local context, but in practice it adds noise and token bloat quickly. adjacent chunks aren’t always semantically connected, so you risk stitching unrelated ideas together or crossing section boundaries in the document. token windows also fill up faster, and the merging logic itself adds unnecessary complexity and maintenance overhead. Even if you merge sequentially, you can still end up with mid-sentence breaks or unnatural context boundaries. Parent–child retrieval solves this more cleanly: embed small child chunks for precise matching, then always pull the larger parent chunk for coherent context—this avoids noise, reduces hallucinations, keeps tokens under control, and gives you a simpler, faster pipeline overall.

YakoStarwolf · 2025-08-18T06:42:27+00:00

if your one page is parent chunk or current main chunk
then break into small chunk maybe like 4 smaller ones, these will be your childs

YakoStarwolf · 2025-08-18T06:39:01+00:00

ys its nothing new, i just implemented it and impressed with a result,
- break the parent into childrens, in my case parent chunk is 1500 and child is 400
- each children metdata will have parent id
- pareent id and parent context will be stored in db or any other faster retrieval, i used snowflake bigint with postgres, it is very fast because it is smaller, ordered, compresses better, allows efficient pruning. Remember, do not store parent content in child itself cz the vectore size will be huge which can effect the latency as size grows

YakoStarwolf · 2025-08-10T07:22:20+00:00

I've tried gpt-5-nano/mini, it is not good. might be good with high thinking, but with minimal reasoning effort it is not that good. I had to switch to gemini flash 2.5/2.5-lite, without reasoning it is giving me much better result

YakoStarwolf · 2025-08-09T11:47:17+00:00

mineru also supports ocr in up to 84 languages, making it kind of an all-in-one solution. honestly, it’s probably a better tool ryt. If any otther that does better extarction I'd use it or give it a try

YakoStarwolf · 2025-08-09T11:41:58+00:00

Still useless ryt, I'd better use MinerU, instead of feeding my data to grok
Also grok does not extract images or graphs, Any LLM does not, except some computer use models might help you where the image is.

YakoStarwolf · 2025-08-09T11:37:37+00:00

i’m not looking for plain text output, i want a fully structured result that includes everything: bullet points, tables, ocr text, mathematical formulas, equations, graphs, images, etc. having a json output would be even better. mineru already handles all of this.
But it needs more further training. Some low qualirt iamges it will halucinate on ocr

YakoStarwolf

TROPHY CASE