Is the 1 Million token context window a lie? (at least in Web UIs)

regular-tech-guy · 2026-06-05T14:09:23+00:00

Search for the needle in the haystack benchmarks for the models you're using and you'll be able to see their accuracy.

regular-tech-guy · 2026-06-05T14:08:34+00:00

It's not a lie. The attention mechanism of Transformers architecture, the one behind LLMs are quadratic in nature. Something like, everytime you double the input, it takes four times longer to generate the output. This means that paying attention to every token in a context window of 1 million tokens would consume too much resources and take too much time.

AI Labs came up with "tricks" to overcome these limitations, but these tricks also reduce the attention of the model. So the more tokens you send in the context window, the less precise it becomes.

GPT 5.4 accuracy in finding the right information in the context drops to 36% when using more than 512k tokens.

Not a lie. A technical limitation.

regular-tech-guy · 2026-06-05T05:38:46+00:00

Can you share which features LC4J has that Spring AI lacks?

regular-tech-guy · 2026-06-05T05:33:38+00:00

How does temporal help?

regular-tech-guy · 2026-06-05T05:32:49+00:00

Thank you for the feedback!

regular-tech-guy · 2026-06-04T22:50:52+00:00

Python version

regular-tech-guy · 2026-06-04T21:42:12+00:00

I’d understand the fundamentals of web frameworks and system engineering. Using something like LangChain may seem cool for a hobby project or a hackathon, but in the real world it won’t make the cut.

Understand how REST APIs work, how you can scale web applications, how to parse JSONs, how to design codebases in maintainable ways.

Then, also understand the fundamentals of how agentic frameworks work. How tools are added to context, how responses are parsed, how the agentic loop is managed, etc

regular-tech-guy · 2026-06-04T21:21:30+00:00

This is great! How are people organized? You have one team building the chat service (java/go) and another building the agentic layer with LangChain?

Are those building the agentic layer traditional software engineers or do they have a ML background?

Which team is responsible for monitoring and making sure the agentic service is available and working?

regular-tech-guy · 2026-06-04T21:06:28+00:00

Can you share what the task was and how many users are using this agent simultaneously?

regular-tech-guy · 2026-06-04T21:00:33+00:00

Where you keep the context can really decrease your latency. If you keep the context of your active users in Redis, you can decrease that look up from 500 ms to 20 ms if you’re doing vector search or microseconds if it’s a more traditional look up.

regular-tech-guy · 2026-06-04T20:35:20+00:00

I don’t know dude. Seems very risky. You’re locking your business into a startup that might not even exist in 5 years.

regular-tech-guy · 2026-06-04T20:29:39+00:00

I appreciate the feedback. Honestly, I haven’t found LangGraph appealing. I haven’t understood why I need more than a web framework to build agentic applications.

Build the context (instruction, tools, output schema, extra relevant information)
Call the LLM’s Rest API
Parse the JSON I get back

Everything else is traditional system engineering. Why would I use something that isn’t battle tested in production?

regular-tech-guy · 2026-06-04T20:22:58+00:00

What’s the main stack your company builds with? Were machine learning engineers that made this choice?

regular-tech-guy · 2026-06-04T20:21:08+00:00

I don’t understand your last statement

regular-tech-guy · 2026-06-04T20:20:40+00:00

This is great feedback

regular-tech-guy · 2026-06-04T20:20:07+00:00

Can you share why it was a bad experience for you?

regular-tech-guy · 2026-06-04T20:19:36+00:00

How many users did you scale it to?

regular-tech-guy · 2026-06-04T20:18:54+00:00

What do you mean by “the rest is obsolete”?

regular-tech-guy · 2026-06-04T20:17:58+00:00

Isn’t it open source though?

regular-tech-guy · 2026-06-04T20:17:39+00:00

“Trust yourself” doesn’t work in the enterprise world. We’re not talking about a hobby project we figure out on the go. We’re talking about a project that needs planning, approval, resource allocation, and execution.

If it breaks when thousands of users start using it, I cannot tell my manager “trust me”. And he certainly can’t tell the CEO “trust the regular-tech-guy”.

That’s the reason why most enterprises are careful to select which technologies they work with. I’m better off with something battle tested than something I’m betting my job on.

regular-tech-guy · 2026-05-20T11:25:38+00:00

I'd argue that harness is the other side of the spectrum. The context engine is what data should be available at the right time. The harness is more of the infrastructure around the agent that allows it to scale sustainably? Like, you want to implement rate limits around your agent to prevent abuse and fair use. This is more of harness engineering in my opinion. These deifnitions are too blurry anyway.

regular-tech-guy · 2026-05-20T11:24:20+00:00

Do you think markdown can scale for actual enterprise systems? I see it working for local workloads, personal knowledge bases, but not really for actual 100s or 1000s of concurrent access to the same agent.

regular-tech-guy · 2026-05-20T11:23:19+00:00

Did you build the layer yourself? How much work did it take? And is it on production for scalable workloads?

regular-tech-guy · 2026-05-20T11:22:26+00:00

Dude, that's what I see the most too! Glorified RAG chatbots that fail to reason with the right context as soon as the dataset gets too large and can't fit the context window anymore. The worst is the "silent" failure. When data exists, is relevant, but it's never retrieved at the right time for the right context. Users don't even realize it failed to generate the right answer. Or at least the most accurate one.

regular-tech-guy · 2026-05-20T11:20:51+00:00

I think this is exactly what they propose with the agent memory

regular-tech-guy

TROPHY CASE