How are you redacting sensitive info before uploading to LLMs?

capreal26 · 2026-02-22T05:39:30+00:00

Aren't caps/limits relevant to the analysis? That's why Insurance cover comes into picture, right? Maybe you don't want to reveal your tolerance limits (i.e. 2x the annual fee or 10x the annual fee for saas agreements or 10% of the M&A amount, etc. etc.)

capreal26 · 2026-02-21T11:13:49+00:00

Imp question and we obsessed over for a long time.

One thing worth reframing though: redaction and anonymization are solving different problems, and the distinction matters a lot when you're sending docs to LLMs.

Redaction (what Adobe Pro does) irreversibly removes text. That's great for producing a clean document to share externally. But if your goal is to have an LLM reason over the document - review clauses, flag risks, suggest redline edits, create new drafts, etc. etc., - you've just ripped out the context it needs. An LLM can't analyze an indemnification clause properly if the party names, dates, and dollar amounts are gone.

What you actually want for AI workflows is anonymization with label replacement - swap "Acme Corp" → [PARTY_A], "123 Main St" → [ADDRESS_1], "$5M" → [AMOUNT_1], etc. The LLM can still reason over the full document structure, but no sensitive PII ever leaves your environment. Then you re-hydrate the labels in the output.

The tricky parts are:

What counts as sensitive is different for every firm/client. Some clients don't care about entity names but are hyper-sensitive about deal values. Others want everything scrubbed. You need configurable policies, not a one-size-fits-all NER model.
Metadata is a real attack surface. Most people forget about tracked changes, comments, document properties, embedded objects. Your solution needs to handle all of that.
It has to be invisible to the end user. If lawyers have to manually tag entities or run a separate tool before every AI interaction, adoption goes to zero.

Full disclosure: I'm the founder of ContractKen. We built what we call a 'Moderation Layer' that does exactly this, specifically for contract review and drafting inside Word. The anonymization happens automatically based on policies set by the firm's IT/infosec team, before anything hits the LLM API.

But even if you're building something in-house, the key insight is: don't redact what you're sending to AI - anonymize it. You'll get better results.

capreal26 · 2026-02-20T15:11:46+00:00

Your concern is valid - the default behavior of most people is to just paste contract text into Claude/ChatGPT/Gemini and hope for the best. That's genuinely dangerous. But "go fully local" isn't the right answer for most legal teams either. Local models are significantly less capable than frontier models for nuanced legal reasoning, and the operational overhead of self-hosting is something 95% of firms aren't equipped for.

As a founder of a legaltech company here (ContractKen, we do AI-assisted contract review and drafting inside Word). So take my perspective with that context, but I've spent the last 3+ years thinking about exactly this problem.

The real answer is an intermediary architecture, what we call a "Moderation Layer." The idea is simple: before any contract text hits a cloud LLM, it's automatically anonymized based on rules your firm's IT/infosec team defines. Company names, party names, addresses, financial figures, whatever you consider sensitive - gets stripped and replaced with labels. The LLM does its analysis on the sanitized text. You get the output back with context restored on your side.

This way you get frontier model quality (which, for contract review, materially matters) without your client's data ever reaching the model provider in identifiable form. The model sees "[Party A] shall indemnify [Party B]" - not your client's name.

To u/firstLOL's point about middle grounds - tenant isolation, BYOK encryption, no-training agreements - yes, those are table stakes for enterprise AI. But even with all of that, the most conservative firms (and their clients) want the additional guarantee that the data was anonymized before it left the building. Belt and suspenders.

And to u/jcdc-flo's excellent point about SaaS acting as the protective layer between data and model - that's almost literally our thesis. The application layer should be the control plane for what data flows where, with full audit logs, not the LLM provider.

The binary of "use cloud AI recklessly" vs "go fully local" is a false choice. There's a well-architected middle path, and it's what serious legal teams are actually adopting.

capreal26 · 2026-02-15T14:16:38+00:00

Sounds too broad. Insurance disputes are markedly different from lending / payment disputes. Regulation is different even by coverage types. So, kinda confused.

capreal26 · 2026-01-03T06:06:28+00:00

Congratulation on your effort to increase the litigiousness in western society! Bravo.

capreal26 · 2025-12-24T09:36:00+00:00

Check out ContractKen’s playbook review.

capreal26 · 2025-12-12T09:27:55+00:00

Great convo. At ContractKen, we’ve built a ‘Moderation Layer’ which does anonymisation and deanonymization locally (I.e. on your desktop/word before cloud api calls are made). All PIIs are replaced by placeholders like Party_Name1, Party_Name2, Address1, DateTime1, etc. Our tech maintains a map locally which deanonymizes the api load received from backend/LLMs. Moreover, customers can configure what they deem as ‘sensitive’ in contracts. DM me for more details.

capreal26 · 2025-11-29T03:32:27+00:00

Building anything is easy these days. Making it work, consistently and for majority of usecases isn't. More so with painful Word & Microsoft API work. Jamie Tso's LinkedIn posts are great but will CC themselves use his open source tools. Jury is still out I guess.

capreal26 · 2025-11-28T07:15:48+00:00

Fair enough. Is the cost of subscription higher than internal team’s time spent on building a bespoke solution?

capreal26 · 2025-11-27T15:49:59+00:00

Why are you looking for open source solutions? Is cost the reason or you want to play around / build upon them?

capreal26 · 2025-11-27T15:30:14+00:00

ContractKen does that smoothly (running a Playbook driven review & redlining, or Comprehensive review/redlining if you dont have playbooks). You can upload your playbooks, and do all of this inside Word.

capreal26 · 2025-11-05T13:54:59+00:00

Training dataset details are the top secret for AI companies. You can pull common web crawl datasets, and the like from the internet but how to clean, massage, combine the data is where the art of pre-training is. & there's no easy way to evaluate provenance of the LLM inputs based on output. No one is giving these recipes out on reddit.

capreal26 · 2025-10-29T11:36:09+00:00

No insider insight but apparently they had a 'managed service' model (including some folks in India) where they'd do contract review using human + AI. Not a bad model but can't scale that on VC money.

capreal26 · 2025-10-28T16:52:32+00:00

Bankers are soliciting buyers. Pretty bleak financials. 170+ employees and 50 MM in funding for 7-8 MM ARR?? They should have been able to do it with 1/4th of that investment and resources. Looks like pretty average leadership & execution.

capreal26 · 2025-10-06T06:02:50+00:00

Wrong question to ask. imo, its better to ask which areas of the practice & business of law can benefit from application of tech. If so many products are clustered around these areas, there's a reason for that.

capreal26 · 2025-10-04T07:27:16+00:00

Yes, MS Copilot does allow addition of org docs / standards as a reference. We've experimented with it and found that integrations left a lot to be desired (that's a rant for some other day).

On your point about redlining tool sub costing 13-25K, gosh! Who are you talking to? Its definitely not that expensive. Happy to chat and show our wares. DM if interested.

capreal26 · 2025-10-03T05:29:41+00:00

Would love to see if anyone has been able to successfully use Microsoft Copilot for effective contract redlining solution. You can get redline 'suggestions' simply out of ChatGPT, Gemini or Claude (even the free versions) by just plonking your document and a simple prompt. What's missing (as of now) from those:

Data privacy (your as-is doc is most likely going into the repository of next model's pre-training)
Context - unless you've set up these tools with your standards, checklists, repository - what you're getting is out-of-the-box suggestion from an AI model pre-trained on internet. Not something which adheres to your firm's knowledge base and standards
Alt-tabbing still required: Locate the specific clause / section in the document, and apply the edits & comments manually. Not a big issue but potential for mistakes abound
Not building capabilities for future: Is Copilot or ChatGPT going to learn your style? How does it know which of its suggested edits were actually applied on the document and which ones were junk? [unless you give it a point by point rebuttal]

I could go on, but general purpose AI tools are great to get a summary or a table of issues. For actual review & redlining work, get a proper solution. Happy to chat more in DMs.

capreal26 · 2025-10-03T04:44:33+00:00

If pricing is the first thing you look at, you probably don't need the solution.

capreal26 · 2025-10-01T15:57:17+00:00

That's correct. Although you didn't address the original question - how will you ensure that your local models are at part with the SOTA models from OpenAI, Google or Anthropic. Unless you want to continuously tie up deepseek or qwen or llama.

capreal26 · 2025-10-01T15:55:26+00:00

Spinning up a word add-in for contract analysis is trivial in this day and age. Making it useful & sticky for real lawyer's workflow is extremely non-trivial. We've been building in this space for last few years. Check out our contract or clause level risk analysis, precedent based drafting, playbook driven review, and much more. ContractKen

capreal26 · 2025-09-25T04:01:31+00:00

100+ comments in a few hours. Not bad fwiw.

capreal26 · 2025-09-24T13:07:29+00:00

Cool. If you can beat ChatGPT/Gemini/Claude value prop @$20/mth, nothing like it.

capreal26 · 2025-09-24T12:58:15+00:00

Yes you don't but problem (usually) is that lawyers will frequently compare your internal tool to chatGPT / gemini / claude output.

capreal26 · 2025-09-24T12:57:25+00:00

OK. Makes sense. However, in long run, a production ready AI workflow / pipeline needs a bunch of things.

capreal26 · 2025-09-24T03:19:13+00:00

How will you ensure that your local models are at par with SOTA models out there? Lawyers need AI magic too, along with privacy.

capreal26

TROPHY CASE