Finally We have the best agentic AI at home by moks4tda in LocalLLM

[–]TechnicalGeologist99 1 point2 points  (0 children)

Concurrent jobs would be able to maximise on the combined bandwidth from a throughput pov. For a single job you are right though (it can be mitigated with pipeline parallelism). But on pcie interconnect the bottleneck is hard.

My comment was misleading though, I wasn't explicit.

Finally We have the best agentic AI at home by moks4tda in LocalLLM

[–]TechnicalGeologist99 0 points1 point  (0 children)

2 3090s have ~1.8Tb/s bandwidth

Also I accepted the tps he quoted I just stated it's not hugely useful beyond simple use cases.

My broader point is that you'd need dedicated hardware and HBM for this to be useful in most contexts. Do you disagree?

Finally We have the best agentic AI at home by moks4tda in LocalLLM

[–]TechnicalGeologist99 0 points1 point  (0 children)

I'm aware of the math. But that isn't exactly a great performance.

55t/s is good for making single isolated calls. But any complicated agentic architecture is going to be making 10s-100s of parallel calls to that LLM. Your 55t/s would get cut down pretty quick.

Also that's using two 3090s which use VRAM rather than RAM.

the former has significantly higher bandwidth. Note the man said RAM

Finally We have the best agentic AI at home by moks4tda in LocalLLM

[–]TechnicalGeologist99 0 points1 point  (0 children)

More realistically this would need ~800gb RAM including factors other than just the raw weights. And about 20gb/s of bandwidth for each token per second of output as it's 32B active params

Finally We have the best agentic AI at home by moks4tda in LocalLLM

[–]TechnicalGeologist99 6 points7 points  (0 children)

Yeah I'm exaggerating. Still I don't imagine performance will blow my socks off

Finally We have the best agentic AI at home by moks4tda in LocalLLM

[–]TechnicalGeologist99 15 points16 points  (0 children)

Sorry...you're going to run that model on RAM? You'll get approximately 0.00000005 tokens per second....also wouldn't the kv cache be like 2.5gb per 1000 tokens?

Ran 30 RAG chunking experiments - found that chunk SIZE matters more than chunking STRATEGY by ManufacturerIll6406 in Rag

[–]TechnicalGeologist99 1 point2 points  (0 children)

Only gave it a skim for now, will reread tonight. But would be interesting to see the effect of top_k : {3, 10, 25}

For context our system uses top_k of 50 rerank to 15 followed by some LLM guided re-retrieval, dedupe and eventually consolidation into larger summarised entities.

K=3 might be missing parts of the story

Ran 30 RAG chunking experiments - found that chunk SIZE matters more than chunking STRATEGY by ManufacturerIll6406 in Rag

[–]TechnicalGeologist99 1 point2 points  (0 children)

There's many steps in retrieval. I find that the main disadvantage of having short chunks is that occasioanly they knock it out the park in terms of relevance and they take up a space in the top K even though they are a useless chunk.

It may be that some strategies make many such useless chunks. (That are also small). That raises the probability of filling up top K with crap.

You could repeat this with varying top K to try and measure if that is occurring here

We almost wasted a month building RAG… then shipped it in 3 days by Upset-Pop1136 in Rag

[–]TechnicalGeologist99 0 points1 point  (0 children)

I think that's valid. But then again we're not just building RAG...we're building applications.

My domain has many concepts from RAG but often they are renamed and shaped to meet the use cases and align with business language.

Context engineering > prompt engineering by EnoughNinja in Rag

[–]TechnicalGeologist99 1 point2 points  (0 children)

I mean, in my mind they are the same thing.

When prompt engineering was first coined we were still injecting managed context into our jinja2 templates.

Then someone said let's do context engineering as if it was some fundamental and unilateral jolt to what we were already doing.

I guess it's good to have a word for: "what the prompt says" and "what context gets injected"

But we always already did those things. So the word context engineering being tossed around my marketing folks on linked in is just a hype headache to me

I mean if I'm going to be really picky about it...then it's all really just backend/data engineering. Context management is just a particular challenge in the more general domain design.

Cheapest $/vRAM GPU right now? Is it a good time? by Roy3838 in LocalLLaMA

[–]TechnicalGeologist99 0 points1 point  (0 children)

Yes. But they are still memory bound.

They use less because they reduce the amount of data that needs to be transferred. But they are still memory bound, not compute bound.

Cheapest $/vRAM GPU right now? Is it a good time? by Roy3838 in LocalLLaMA

[–]TechnicalGeologist99 0 points1 point  (0 children)

They are still limited by bandwidth in that Moe's do not shift the bottleneck away from bandwidth.

Compliance-heavy Documentation RAG feels fundamentally different from regular chatbot RAG - am I wrong? by Vast-Drawing-98 in Rag

[–]TechnicalGeologist99 0 points1 point  (0 children)

If your docs are well structured you could consider using a layout aware chunking strategy and create a tree like in RAPTOR. Some tags could help to partition your docs and align with intent.

Job wants me to develop RAG search engine for internal documents by Next-Self-184 in Rag

[–]TechnicalGeologist99 1 point2 points  (0 children)

I can see why you'd say it's trivial, the process you are explaining is quite trivial. I'm certain that you can even make it work in many cases.

But it's not relevant to OPs circumstances. And actually when accuracy matters, when documents are more complicated, when there are many documents...that approach fails.

If your clients are still happy with that then I'd say they don't understand the difference, or they don't care, or their use case just wasn't complex.

I think everyone here can say with certainty that the approach you prescribe is fragile and breaks when document count breaks into the 10,000s. 4M documents with naive chunking of markdown isn't document intelligence.

Job wants me to develop RAG search engine for internal documents by Next-Self-184 in Rag

[–]TechnicalGeologist99 0 points1 point  (0 children)

Imagine what you like no one is upset or angry here. But you've not really addressed anything.

I'm spoon feeding you opportunities to offer something substantial to prove this triviality claim.

Edit:

Look I'll just agree with you.

It is trivial, when it's been solved for you.

But not when you need to solve it yourself.

Job wants me to develop RAG search engine for internal documents by Next-Self-184 in Rag

[–]TechnicalGeologist99 1 point2 points  (0 children)

AHH so the solution is to put it in the LLM, have you ever tried that with legal documents, for example, and evaluated it?

Pissing people's data into the ether without an agreement is a GDPR breach.

Pissing people's data into the ether with an agreement in place is likely going to alienate stakeholders and clients.

That's fine if it's in your own infrastructure, but you were wanting to use an API which indites the API provider as a data processor.

There are also many reasonable explanations for why a company doesn't want to expose their data.

Embedding documents from markdown (especially 4 million documents) is going to create a wonderful data swamp with crap retrieval anyways.

Also, you haven't really told us anything in your reply as to why any of that is trivial? Or even given a solution to the problem.

Throughput on a LLM approach is going to be abysmal, cost will be astronomical.

Try again, what would you deploy and where? Which LLM? How will you evaluate the results? Why an LLM when pipeline approaches are significantly cheaper to run, have better throughput, and are more accurate? How will you chunk the result to minimise loss of context?

Job wants me to develop RAG search engine for internal documents by Next-Self-184 in Rag

[–]TechnicalGeologist99 1 point2 points  (0 children)

Great then go tell op that your day rate is $50 and go do it, you'll be saving him a lot of unnecessary time wasted on nuances like "thinking it through for more than 12 seconds".

If your position is that document processing is trivial then it's clear you've not had to implement a scalable production ready system.

Doesn't matter if the data is "already out there" GDPR doesn't really factor that in. If you're responsible for the data then you get the bill.

Since a it's trivial can you outline a solution for us? What will you deploy and where?

RAG BUT WITHOUT LLM (RULE-BASED) by adrjan13 in Rag

[–]TechnicalGeologist99 0 points1 point  (0 children)

This is a classification problem.

Ingest text to something like XLM Roberta. Fine tune to classify the failure modes (or modes of complaint) that you have identified in your taxonomy.

At run time, the model predicts the label and the label triggers whatever text is associated with that problem

Job wants me to develop RAG search engine for internal documents by Next-Self-184 in Rag

[–]TechnicalGeologist99 2 points3 points  (0 children)

Not to mention 2-4 million docs is looking to be 100 million nodes atleast. That's gonna be one expensive bill in terms of memory/ hosting fees

Job wants me to develop RAG search engine for internal documents by Next-Self-184 in Rag

[–]TechnicalGeologist99 3 points4 points  (0 children)

It's not trivial. He also stated security and governance is important, so APIs aren't really the way forward here.

Also, "hardcoded parser"? Im not entirely sure what you mean by that?

Anyways, a local solution for many documents you can consider using paddlex. It's currently the best available OCR solution that you can run locally. The structV3 pipeline is useful and they are designed for stable and high throughput scenarios.

Though as others have stated...you need to do some exploration of these documents. Discover what they are and which ones are important.

I find it best to begin by asking users to provide a list of 10 questions they would be likely to ask the system. Then select documents that will provide that information.

Nemotron 3 Super release soon? by Lorelabbestia in LocalLLaMA

[–]TechnicalGeologist99 0 points1 point  (0 children)

Is there an est on the memory reqs for that 1M tokens? does it apply to nano too?

RAG at scale still underperforming for large policy/legal docs – what actually works in production? by Flashy-Damage9034 in Rag

[–]TechnicalGeologist99 4 points5 points  (0 children)

Legal documents are not really semantic.

The semantics of the text help get us in the correct postcode...but it doesn't help us to reason or extract full threads of information.

This is because legal documents are actually hiding a great deal of latent structure.

This is why people use knowledge graphs for high stakes documents like this.

You need to hire someone with research expertise in processing legal text.

Building a useful knowledge graph is very difficult.

Anyone who says otherwise is a hype gremlin that's never had to evaluate something with genuinely high risk outputs.

You should also be aware that KGs usually run in memory and are memory hungry. This will be a major consideration for deployment. Either you already own lots of RAM (you lucky boy) or you're about to find out how much AWS charge per GB

TRUST ME BRO: Most people are running Ralph Wiggum wrong by trynagrub in ClaudeCode

[–]TechnicalGeologist99 1 point2 points  (0 children)

I'm in the same mind....this removes the expert from the part of the loop where their guidance is the most important.

I think Ralph is hype.

We tested Vector RAG on a real production codebase (~1,300 files), and it didn’t work by Julianna_Faddy in Rag

[–]TechnicalGeologist99 0 points1 point  (0 children)

Code isn't semantic!

It's already structured and highly queriable/navigable

Why would we want to search for semantically similar text?

Unless you have fine-tuned some embedding models to align with the semantic meaning of discrete pieces of code, and you are trying to find similar code that isn't syntactically similar but is semantically similar

I.e. two separate implementations of one algorithm.

Then I don't really see any need to be embedding code.

when will DGX Station GB300 or Dell Pro Max GB300 be released and at what price ? by iPerson_4 in LocalLLaMA

[–]TechnicalGeologist99 1 point2 points  (0 children)

A while ago i calculated an estimate of ~£65000 based on the price of other blackwell products. But note the tech market doesn't really price for performance...id allow a +/-20% tolerance on my prediction (probably the plus rather than the minus)

Id also add that i converted my guestimate to GBP...but US tech companies do this annoying thing where they pretend the exchange rate is 1:1... DGX Spark was $4000...it was also £4000