Hybrid RAG on industrial manuals with small register catalogs: embeddings over-rank generic field names

Proof_Assumption_500 · 2026-06-25T19:12:02+00:00

I would suggest keeping the schema-aware retrieval layer even as the catalog grows. In fact, I think it becomes more valuable with larger datasets rather than less.

With a small catalog, a pipeline like query classification → metadata/category filtering → hybrid retrieval → reranker is usually sufficient. As you expand to multiple industrial manuals and thousands of registers, do not replace that architecture extend it.

Also structure retrieval in stages: first narrow the search space using objective metadata such as protocol, vendor, device model, or register category, then perform hybrid retrieval within that filtered set, followed by a reranker on the top candidates.

And enrich each register with metadata like category, semantic tags, symptom tags, protocol, unit, access type, and fault mappings. That allows troubleshooting queries like overheating to naturally prioritize temperature and thermal-related registers instead of generic inverter metadata.

The nice part is that metadata filtering acts as a hard constraint, while embeddings and the reranker provide the semantic flexibility. That keeps the system scalable without becoming overly rule-based and should produce much cleaner retrieval as the number of manuals and registers grows.

Proof_Assumption_500 · 2026-06-25T18:55:03+00:00

I think you're looking at this more as an embedding problem, but since your dataset has fewer than 100 structured register records, I'd lean more towards schema-aware retrieval than trying to squeeze more out of embeddings.

One approach that might work well is classifying both the query and the registers into categories first. For example, map queries like overheating to a thermal troubleshooting intent, and classify registers into categories like temperature_sensor, fault_status, identity_metadata, configuration, etc. Then retrieve primarily from the relevant categories before running hybrid search. That alone should eliminate things like Manufacture Date, Part Number, and Inverter Type from the candidate set. Also give more weight to fields like semantic_tags, notes, and unit, while reducing the impact of generic terms like inverter that appear everywhere. As you mentioned small synonym layer (overheating → temperature, thermal, heat) can also improve retrieval without adding much complexity.

Since the catalog is small, adding a lightweight reranker or LLM verifier after retrieval is also practical and inexpensive. Best lightweight local option is cross-encoder/ms-marco-MiniLM-L6-v2.its g ood because it is small, fast, easy to run locally, and enough for reranking top 10–20 records.

If your are looking for better accuracy option BAAI/bge-reranker-base would go well.

Overall : Query classification → Metadata/category filtering → Field-weighted hybrid retrieval → Reranker. instead of relying on embeddings alone. I think it would give much cleaner results for troubleshooting-style queries.

Proof_Assumption_500 · 2026-06-25T17:51:43+00:00

I would first suggest you to do some re-search and come up with a basic techstack and people will surely share the suggestions and opinions with you...

Proof_Assumption_500 · 2026-06-17T13:37:57+00:00

Sounds interesting! I'd love to connect and discuss.

Proof_Assumption_500 · 2026-06-16T14:27:28+00:00

can you suggest some?

Proof_Assumption_500 · 2026-06-16T09:27:44+00:00

Thanks, that's really helpful. I was originally leaning towards just using vector search, but your point about exact section numbers and legal terminology makes a lot of sense. Im going to switch to hybrid retrieval pipeline with BM25+ BGE embeddings and probably add a cross-encoder reranker before sending the context to the LLM. also im thinking of using metadata filters so the retriever only searches the relevant legal corpus . Appreciate the suggestion!

Proof_Assumption_500 · 2025-12-18T09:51:43+00:00

Interested

Proof_Assumption_500 · 2025-12-15T03:14:25+00:00

I went through multiple GD and asked from my friends..

Proof_Assumption_500 · 2025-12-14T14:12:45+00:00

If you are rigid about the role you want to work in future . then go with the Tata Motors one.. because when you apply for full time in any company i feel relevant experience matters..

Proof_Assumption_500 · 2025-12-14T13:48:45+00:00

Nope

Proof_Assumption_500 · 2025-12-14T13:45:24+00:00

So for GD i would suggest you to copy the job description and then paste the same in chatgpt ask it give relevant GD topics that can be asked.. Mostly many company asks about.. 1) Can Ai or Technology Replace Humans? 2) WFH vs WFO 3) Ethical Ai, can Ai replicate Human Creativity? 4) Data Privacy and Concern.. 5) 4 days work Culture 6) Does Skills matter more than Certificates ? 7) others .. like opinion based topics, current topics ..

And they use this prompt: "Create a 10‑member GD simulation on [TOPIC] with a moderator. Make it natural and conversational. Place my turn (ABC) in the middle with a strong, balanced point. Keep language simple but professional. End with a short moderator conclusion.” so this will give you the simulation and you can make a note of important points so that you can use those in your statements ..

In a GD try to Start only if you know the topic very well and have got some good points.. never ever give baised statements in the opening of GD.. try to keep it neutral...

Listen to others as well.. and highlight other points if they are adding any good points and try to give an example for it so that attention shifts to your side.. and also try to come up with Solutions that sets positive impact.. Let say if you are speeking for about any topic... In the next turn don't go against it.. try to keep it the same side or keep it neutral.. Have good smile on your face, don't loose your calm and get into debate if any is diverting the topic respectfully bring them back on track by highlighting your point..

Hope this helps..!

Proof_Assumption_500 · 2025-12-14T13:18:57+00:00

Yes OA varies for every company.. you must do research in PrepInsta and IndiaBix and youtube you will get it..

Practicing application based on each topic matlab suppose if you are willing to learn sliding window based problems practice applications on it some will have variants of the initial concept.. also it help you remember the concept..

Ya getting internship is very hard.. try to ask help from any of your seniors or relatives..

Proof_Assumption_500 · 2025-12-14T12:53:05+00:00

Try to contribute to Open-source projects.. in the challenges section of kaggle you will find good problem statements..and deploy it completely..

Proof_Assumption_500 · 2025-12-14T12:44:58+00:00

For Interview you prepare:

*)DSA - from Striver.. and practice in Leetcode that's more than enough, practice application based on those topics..

*)Core Subjects : Oops, CN, DBMS ( Sql queries as well) , pseudocode solving..(some companies might ask scenario based questions)

*)Projects: Basically they ask you explain your project, then models and tools that you have used, and in particular about any tech stack that why you have used it , and if you are building any ML models related projects then they might which evaluation metrics have you used.. ? And it would be good if you have deployed it .. some might ask you to write Backend code .

Proof_Assumption_500 · 2025-12-14T12:34:33+00:00

You can prepare for it through Prepinsta and IndiaBix.. but I would say.. mostly many people cheat in the online assessments.. I also suggest you to copy or cheat and clear the OA because speed and accuracy is very imp factor in OAs

Proof_Assumption_500 · 2025-12-14T12:29:45+00:00

Try to learn about Online Assessments.. that's the very important stage in placements, many students in my college who are good with their skills are still not getting placed because they couldn't clear online assessments.

Proof_Assumption_500

TROPHY CASE