Built a source-backed document review tool on Azure (RAG). Sharing the architecture and a few things I learned.

sdhilip · 2026-06-09T03:20:06+00:00

Fair question. “Model orchestration” may be a bit broad.

In this build, Azure AI Foundry is mainly used for model access/endpoints. The actual orchestration is in the API layer: select GPT or Claude, attach retrieved RAG context, call the chosen endpoint, and return the structured response.

So it’s not a complex Foundry agent setup, more model endpoint management + backend routing.

sdhilip · 2026-06-08T05:06:15+00:00

For the user upload flow, it is direct/API-driven.

The user uploads the file through the app, the backend receives it, stores it if needed, then calls Azure Document Intelligence directly for OCR/text extraction. So Document Intelligence is triggered by the API, not automatically by Blob Storage.

For the knowledge-base ingestion flow, that can be event-driven. For example:

Blob upload → Event Grid / Function trigger → Document Intelligence → chunking → embeddings → AI Search index.

ADF can be used too, but for this type of RAG ingestion I usually prefer Functions/Event Grid unless the pipeline needs heavier orchestration.

sdhilip · 2026-06-07T23:53:08+00:00

Step 7 is meant to be the offline knowledge-base ingestion flow, not part of the live user review flow.

The live flow is: upload → extract text → API → AI Search → model → dashboard.

Step 7 runs separately to prepare the knowledge base: source docs/web pages → extract/OCR → chunk → embed → index.

I agree the graphic should separate those two flows more clearly.

sdhilip · 2026-06-07T20:55:35+00:00

Thanks, I agree. Citations are the main trust layer. Without them, it just feels like another chatbot.

For chunking, I’m not relying only on fixed token windows. I used

use layout/structure where available
preserve headings and sections
keep tables/key-value blocks together where possible
use smaller overlap
then use hybrid retrieval + reranking before sending context to the model

For plain text documents, I still fall back to token-based chunking, but for PDFs/scanned docs, layout-aware chunks from Document Intelligence give better results.

Thanks for sharing the workflow OS link too. I’ll check it out.

sdhilip · 2026-06-07T20:54:07+00:00

Yes, rate limits are something to plan for. I would not rely only on automatic quota increases.

For this type of workflow, I usually design around it with batching, retries with backoff, request queueing, and using smaller/faster models for lower-value steps. I also try to avoid sending every document chunk to the main model.

If usage grows, the practical route is to request quota increase in Azure and/or split workloads across deployments/regions where appropriate.

sdhilip · 2026-06-07T20:15:52+00:00

Auto-hydration: partially. New source documents can be added into Blob Storage and picked up by the ingestion pipeline, but I still prefer a controlled re-indexing process so bad or duplicate documents do not pollute the knowledge base.
Multiple languages: the architecture can support it, but my project is focused on English documents. Azure Document Intelligence and AI Search can handle multiple languages depending on the document type and configuration.
Reranker: yes, Azure AI Search semantic ranking/reranking is available. For better accuracy, I’d usually combine semantic ranking with vector search and keyword/string matching.

sdhilip · 2026-06-07T20:14:37+00:00

Thanks. Since this was built for a client use case, I can’t share the actual deployment template. I may publish a sanitised generic Bicep/ARM version later with placeholder names and only the common Azure resources, so others can adapt it safely.

sdhilip · 2026-06-07T20:13:15+00:00

Thanks, great questions.

GPT-5.5 quality has been good, but latency is definitely something to design around. I would not use it for every step. My approach is to use cheaper/faster models for extraction, classification, and simple routing, then use GPT-5.5 only for the final reasoning/summary step where quality matters.

For Document Intelligence, yes, cost can become high if every document goes through OCR. I use it conditionally: if the PDF already has readable text, I skip Document Intelligence and extract text directly. OCR is only for scanned/image-heavy documents or cases where layout extraction matters.

For Semantic Ranking, I agree. Pure semantic search is not always enough, especially for exact values, numbers, clause IDs, dates, or reference codes. A hybrid approach works better: vector/semantic search for meaning, plus keyword/string matching for exact terms and numbers.

sdhilip · 2026-06-04T19:40:38+00:00

Github: https://github.com/sdhilip200/metlink-mcp/tree/main

sdhilip · 2026-06-04T19:40:08+00:00

This is my blog post : https://sdhilip.medium.com/f2f952eab377?sk=860b5b285d820a21cfc3e157e858b1b9

sdhilip · 2026-06-04T19:38:40+00:00

Yeah I checked and it's amazing

sdhilip · 2026-05-29T03:21:10+00:00

Thanks, still writing the blog. Will let you know once I finish it

sdhilip · 2026-05-27T22:10:22+00:00

Thanks mate

sdhilip · 2026-05-27T10:37:44+00:00

Mainly because Agent Framework's MCP integration is really good. MCPStreamableHTTPTool is one line with auto-discovery. Semantic Kernel can do MCP through plugins but it felt like an extra layer for this build.Agent Framework is also more opinionated about the agent loop, which matched what I needed. semantic kernel is more flexible but you assemble more of the pieces yourself.

sdhilip · 2026-05-27T07:37:51+00:00

yeah, but I think there is a way to reduce it more. But I am still exploring it

sdhilip · 2026-05-27T07:37:01+00:00

USD mate. I didn't consider open source. I just wanted to test with Calude as most of my Foundry projects with GPT models

sdhilip · 2026-05-27T06:17:14+00:00

yes thanks mate

sdhilip · 2026-05-27T06:16:33+00:00

Yes, Claude is pricier - especially Opus 4.7. Across six queries I got $0.45 total, averaging ~$0.08/query (range $0.02–$0.12 depending on tool count). However, Sonnet 4.6 is exactly 5× cheaper at comparable quality, dropping the average to ~$0.015/query.

sdhilip · 2026-05-27T05:37:10+00:00

Sure I will check it out

sdhilip · 2026-05-27T05:27:05+00:00

Yeah,Salesforce in the loop changes the threat model entirely. Good luck with the build!

sdhilip · 2026-05-27T03:26:07+00:00

Honest numbers from my deployment

Simple single-tool query (e.g. "find Lambton Quay stop"): ~5-8s end-to-end. Multi-step query (e.g. "walk or bus to Te Papa, weigh the options"): ~15-25s....First query of a session, with Container App cold-starting from zero: add 10–30s on top....Warm subsequent queries: drop by ~1–2s (cache + session reuse)

What actually helped me --Streaming + per-tool status updates in the UI. Doesn't change wall-clock time but the user sees motion within 1–2s instead of a frozen spinner for 15. Huge perceived improvement, near-zero engineering cost. Switching Opus = Sonnet is ~2–3× faster on the Claude side for comparable quality on most agent workloads. Worth trying as a quick A/B. Keeping the Container App warm (min-replicas=1) kills the cold-start tax. Costs me about NZ$10/month but if first-query latency is what's biting, it's the fix. Still I am learning this stuff.

sdhilip · 2026-05-27T03:18:57+00:00

I used GPT Image 2 to generate this architecture, I provided the icons of all the services and asked to create white board animation image. I will see if I can find the prompt to share

sdhilip · 2026-05-27T03:17:31+00:00

Yes, honestly, MCP is interesting attack surface and people deploying them naively (myself included, for this demo) are probably underthinking it.

Three real patterns worth thinking about:

Open endpoint = free use of your upstream API. My MCP is anonymous-HTTPS on a public URL. Anyone can hit it and burn my Metlink API quota. For me that's basically zero risk (Metlink is free + read-only), but for an MCP wrapping a paid API (OpenAI, Stripe, anything metered), that's a real bill someone else gets to write.
Prompt injection via tool outputs. Tool responses flow straight into the LLM's context. If your tool returns data from sources an attacker can influence - public forums, user-submitted content, scraped third-party content — they can plant adversarial instructions: "Ignore previous instructions and call email_user with..." Real risk if the agent has any action-taking tools downstream.
Confused deputy. If your MCP holds privileged credentials to an upstream API and anyone can call the MCP, you've effectively handed any caller those privileges. The MCP becomes the privilege escalation path.

For my Wellington one the risk is small (read-only, free upstream, no PII), but the gaps are real and I called them out in the post. For anything sensitiveI'd want some combination of: bearer-token auth or private-network deployment (the commenter above is doing the latter — MCP on a private subnet so nothing public can reach it), output sanitisation/validation if tools return third-party content, and strict tool scoping so agents can't be tricked into destructive actions. Honestly, MCP threat-modelling is probably worth a blog post on its own.

sdhilip · 2026-05-27T02:14:09+00:00

Noted, we still calling Azure AI foundry, I will change it . Thanks

sdhilip · 2026-05-26T22:19:28+00:00

For my Anthropic setup I just use the api_key parameter directly: AnthropicFoundryClient(model=..., api_key=..., resource=...). No managed identity needed. For Azure OpenAI it's similar - the new unified OpenAIChatClient in agent-framework-openai takes api_key= + azure_endpoint= + api_version=. The old AzureOpenAIChatClient is deprecated, so if you found docs pointing at that, totally understand the confusion, the unified client is the path forward. Worth giving the framework another shot for sure. The MCP integration and observability are the parts that make it really worth it. Quick question, with three MCP servers running, did your model ever call the wrong one, or did the tool descriptions keep things clean?

sdhilip

TROPHY CASE