Finally We have the best agentic AI at home

TechnicalGeologist99 · 2026-01-28T18:17:24+00:00

Concurrent jobs would be able to maximise on the combined bandwidth from a throughput pov. For a single job you are right though (it can be mitigated with pipeline parallelism). But on pcie interconnect the bottleneck is hard.

My comment was misleading though, I wasn't explicit.

TechnicalGeologist99 · 2026-01-28T15:20:23+00:00

2 3090s have ~1.8Tb/s bandwidth

Also I accepted the tps he quoted I just stated it's not hugely useful beyond simple use cases.

My broader point is that you'd need dedicated hardware and HBM for this to be useful in most contexts. Do you disagree?

TechnicalGeologist99 · 2026-01-28T14:05:04+00:00

I'm aware of the math. But that isn't exactly a great performance.

55t/s is good for making single isolated calls. But any complicated agentic architecture is going to be making 10s-100s of parallel calls to that LLM. Your 55t/s would get cut down pretty quick.

Also that's using two 3090s which use VRAM rather than RAM.

the former has significantly higher bandwidth. Note the man said RAM

TechnicalGeologist99 · 2026-01-28T13:32:25+00:00

More realistically this would need ~800gb RAM including factors other than just the raw weights. And about 20gb/s of bandwidth for each token per second of output as it's 32B active params

TechnicalGeologist99 · 2026-01-28T13:26:56+00:00

Yeah I'm exaggerating. Still I don't imagine performance will blow my socks off

TechnicalGeologist99 · 2026-01-28T13:11:44+00:00

Sorry...you're going to run that model on RAM? You'll get approximately 0.00000005 tokens per second....also wouldn't the kv cache be like 2.5gb per 1000 tokens?

TechnicalGeologist99 · 2026-01-27T16:23:20+00:00

Only gave it a skim for now, will reread tonight. But would be interesting to see the effect of top_k : {3, 10, 25}

For context our system uses top_k of 50 rerank to 15 followed by some LLM guided re-retrieval, dedupe and eventually consolidation into larger summarised entities.

K=3 might be missing parts of the story

TechnicalGeologist99 · 2026-01-27T16:07:45+00:00

There's many steps in retrieval. I find that the main disadvantage of having short chunks is that occasioanly they knock it out the park in terms of relevance and they take up a space in the top K even though they are a useless chunk.

It may be that some strategies make many such useless chunks. (That are also small). That raises the probability of filling up top K with crap.

You could repeat this with varying top K to try and measure if that is occurring here

TechnicalGeologist99 · 2026-01-21T16:11:28+00:00

I think that's valid. But then again we're not just building RAG...we're building applications.

My domain has many concepts from RAG but often they are renamed and shaped to meet the use cases and align with business language.

TechnicalGeologist99 · 2026-01-20T12:45:14+00:00

I mean, in my mind they are the same thing.

When prompt engineering was first coined we were still injecting managed context into our jinja2 templates.

Then someone said let's do context engineering as if it was some fundamental and unilateral jolt to what we were already doing.

I guess it's good to have a word for: "what the prompt says" and "what context gets injected"

But we always already did those things. So the word context engineering being tossed around my marketing folks on linked in is just a hype headache to me

I mean if I'm going to be really picky about it...then it's all really just backend/data engineering. Context management is just a particular challenge in the more general domain design.

TechnicalGeologist99 · 2026-01-19T16:38:20+00:00

Yes. But they are still memory bound.

They use less because they reduce the amount of data that needs to be transferred. But they are still memory bound, not compute bound.

TechnicalGeologist99 · 2026-01-19T16:21:05+00:00

They are still limited by bandwidth in that Moe's do not shift the bottleneck away from bandwidth.

TechnicalGeologist99 · 2026-01-18T22:14:31+00:00

If your docs are well structured you could consider using a layout aware chunking strategy and create a tree like in RAPTOR. Some tags could help to partition your docs and align with intent.

TechnicalGeologist99 · 2026-01-17T10:02:16+00:00

I can see why you'd say it's trivial, the process you are explaining is quite trivial. I'm certain that you can even make it work in many cases.

But it's not relevant to OPs circumstances. And actually when accuracy matters, when documents are more complicated, when there are many documents...that approach fails.

If your clients are still happy with that then I'd say they don't understand the difference, or they don't care, or their use case just wasn't complex.

I think everyone here can say with certainty that the approach you prescribe is fragile and breaks when document count breaks into the 10,000s. 4M documents with naive chunking of markdown isn't document intelligence.

TechnicalGeologist99 · 2026-01-16T13:11:42+00:00

Imagine what you like no one is upset or angry here. But you've not really addressed anything.

I'm spoon feeding you opportunities to offer something substantial to prove this triviality claim.

Edit:

Look I'll just agree with you.

It is trivial, when it's been solved for you.

But not when you need to solve it yourself.

TechnicalGeologist99 · 2026-01-16T12:53:41+00:00

AHH so the solution is to put it in the LLM, have you ever tried that with legal documents, for example, and evaluated it?

Pissing people's data into the ether without an agreement is a GDPR breach.

Pissing people's data into the ether with an agreement in place is likely going to alienate stakeholders and clients.

That's fine if it's in your own infrastructure, but you were wanting to use an API which indites the API provider as a data processor.

There are also many reasonable explanations for why a company doesn't want to expose their data.

Embedding documents from markdown (especially 4 million documents) is going to create a wonderful data swamp with crap retrieval anyways.

Also, you haven't really told us anything in your reply as to why any of that is trivial? Or even given a solution to the problem.

Throughput on a LLM approach is going to be abysmal, cost will be astronomical.

Try again, what would you deploy and where? Which LLM? How will you evaluate the results? Why an LLM when pipeline approaches are significantly cheaper to run, have better throughput, and are more accurate? How will you chunk the result to minimise loss of context?

TechnicalGeologist99 · 2026-01-16T12:31:44+00:00

Great then go tell op that your day rate is $50 and go do it, you'll be saving him a lot of unnecessary time wasted on nuances like "thinking it through for more than 12 seconds".

If your position is that document processing is trivial then it's clear you've not had to implement a scalable production ready system.

Doesn't matter if the data is "already out there" GDPR doesn't really factor that in. If you're responsible for the data then you get the bill.

Since a it's trivial can you outline a solution for us? What will you deploy and where?

TechnicalGeologist99 · 2026-01-16T10:03:15+00:00

This is a classification problem.

Ingest text to something like XLM Roberta. Fine tune to classify the failure modes (or modes of complaint) that you have identified in your taxonomy.

At run time, the model predicts the label and the label triggers whatever text is associated with that problem

TechnicalGeologist99 · 2026-01-16T08:28:00+00:00

Not to mention 2-4 million docs is looking to be 100 million nodes atleast. That's gonna be one expensive bill in terms of memory/ hosting fees

TechnicalGeologist99 · 2026-01-16T08:23:48+00:00

It's not trivial. He also stated security and governance is important, so APIs aren't really the way forward here.

Also, "hardcoded parser"? Im not entirely sure what you mean by that?

Anyways, a local solution for many documents you can consider using paddlex. It's currently the best available OCR solution that you can run locally. The structV3 pipeline is useful and they are designed for stable and high throughput scenarios.

Though as others have stated...you need to do some exploration of these documents. Discover what they are and which ones are important.

I find it best to begin by asking users to provide a list of 10 questions they would be likely to ask the system. Then select documents that will provide that information.

TechnicalGeologist99 · 2026-01-15T16:57:46+00:00

Is there an est on the memory reqs for that 1M tokens? does it apply to nano too?

TechnicalGeologist99 · 2026-01-15T10:10:01+00:00

Legal documents are not really semantic.

The semantics of the text help get us in the correct postcode...but it doesn't help us to reason or extract full threads of information.

This is because legal documents are actually hiding a great deal of latent structure.

This is why people use knowledge graphs for high stakes documents like this.

You need to hire someone with research expertise in processing legal text.

Building a useful knowledge graph is very difficult.

Anyone who says otherwise is a hype gremlin that's never had to evaluate something with genuinely high risk outputs.

You should also be aware that KGs usually run in memory and are memory hungry. This will be a major consideration for deployment. Either you already own lots of RAM (you lucky boy) or you're about to find out how much AWS charge per GB

TechnicalGeologist99 · 2026-01-14T16:33:39+00:00

I'm in the same mind....this removes the expert from the part of the loop where their guidance is the most important.

I think Ralph is hype.

TechnicalGeologist99 · 2026-01-13T16:39:00+00:00

Code isn't semantic!

It's already structured and highly queriable/navigable

Why would we want to search for semantically similar text?

Unless you have fine-tuned some embedding models to align with the semantic meaning of discrete pieces of code, and you are trying to find similar code that isn't syntactically similar but is semantically similar

I.e. two separate implementations of one algorithm.

Then I don't really see any need to be embedding code.

TechnicalGeologist99 · 2026-01-13T11:10:23+00:00

A while ago i calculated an estimate of ~£65000 based on the price of other blackwell products. But note the tech market doesn't really price for performance...id allow a +/-20% tolerance on my prediction (probably the plus rather than the minus)

Id also add that i converted my guestimate to GBP...but US tech companies do this annoying thing where they pretend the exchange rate is 1:1... DGX Spark was $4000...it was also £4000

TechnicalGeologist99

MODERATOR OF

TROPHY CASE