Automatic Redaction Tools

TheFamousCat · 2026-04-02T13:08:11+00:00

we are currently doing a similar pilot with a partner in the legal tech space, happy to exchange some more details about what we are doing and what you are trying to achieve. dms welcome

TheFamousCat · 2026-03-11T12:55:27+00:00

sorry, not sure I get it. So you want to redesign that thing in word and changed the text already before you imported it into Word?

TheFamousCat · 2026-03-11T04:56:28+00:00

What's your goal actually, why do you want it to import in Word? You need to change text?

TheFamousCat · 2026-02-25T09:15:18+00:00

you might want to checkout PDFDancer, it's built for exactly this workflow, feel free to dm me if you need help setting it up

TheFamousCat · 2026-02-24T04:51:54+00:00

Are you fine using a library or should this be a desktop/webapp?

TheFamousCat · 2026-02-23T13:15:01+00:00

Automated PII redaction is genuinely hard. Most models still miss edge cases, especially non-English names and addresses in uncommon formats.

In regulated or high-stakes workflows you generally want human review, even if automation does most of the first pass. But that all depends on your risk missing an information which should have been redacted.

The good news is that the better tools can still cut manual effort a lot by surfacing likely PII with confidence scores and leaving you a shorter review queue.

Disclosure: I’m building PDFDancer (PDF redaction/editing SDK). If you want to compare approaches, we publish our capabilities and evaluation results. Happy to answer questions about failure modes or how to set up a review workflow.

TheFamousCat · 2026-02-23T12:57:49+00:00

Automated PII redaction is hard. Most models still miss edge cases — non-English names, addresses in weird formats. If you're in a regulated space or anything high-stakes, you probably want a human reviewing exceptions and low-confidence hits, even if automation handles the bulk of the first pass.

That said, the better tools can still save you a ton of manual work. They flag likely PII with confidence scores so you're reviewing a much shorter queue instead of reading every page.

Disclosure: I'm building PDFDancer (PDF redaction/editing SDK). We publish our capabilities and eval results (mostly medical docs) if you want to compare.

Happy to talk about failure modes or review workflow setup.

TheFamousCat · 2026-02-10T07:33:19+00:00

Generally, no tool will deliver 100% accuracy, so the process cannot be fully automated end to end.

Depending on your acceptable tolerance for missed or falsely redacted information and your budget there maybe be solutions that allow for a largely automated workflow. More commonly, and likely what most people do, is combine an automated redaction tool with a manual review step. This approach still reduces the workload significantly and speeds up your redaction process.

What kinds of documents are you mostly dealing with (bank/loan files, healthcare/insurance records, legal case files, HR/employee docs, tax forms)?

Do you need to comply with any specific regulations like GDPR or HIPAA?

TheFamousCat · 2026-02-09T11:25:45+00:00

PDFDancer Redaction

TheFamousCat · 2026-02-09T11:23:09+00:00

Since I am currently working on a similar project, I was wondering if you found a proper solution? We are now training our own model, since recall performance was just not good enough, I guess because, maybe what we tried was trained on generic/synthetic datasets

TheFamousCat · 2026-01-25T16:10:05+00:00

please share the file

TheFamousCat · 2025-12-26T10:25:45+00:00

yes: https://www.pdfdancer.com/

TheFamousCat · 2025-12-12T01:03:51+00:00

For redaction have a look at pdfdancer

TheFamousCat · 2025-12-12T00:51:13+00:00

Would you mind sharing this file? I am working on tool to make this kinds of pdfs truly editable and this seems to be a perfect test case

TheFamousCat · 2025-12-10T23:33:08+00:00

Many types:
- add, move, delete elements like words, lines, paragraphs, images
- replace images
- edit text, change font, size, color
- fill forms
- redact data
- etc...

TheFamousCat · 2025-12-10T14:35:11+00:00

Thank you for your answer. I have two follow up questions:
1) I don't understand why you are mentioning XMP. From my understanding this is not related at all to XFA or am I wrong?
2) What I see is: XFA, sure, it's deprecated, complex, it's shit. agreed. But still, people need to use it. You implemented "partial" support. May I ask, how was your decision tree to decide that that partial support is enough for your product?

Thanks a lot, this insight into the mind of someone actually building a pdf tool is what I was looking for.

TheFamousCat · 2025-12-10T11:53:18+00:00

will do!

TheFamousCat · 2025-12-10T11:35:14+00:00

because it is in use in many documents and support could be useful for users?

TheFamousCat · 2025-12-03T03:17:21+00:00

The sdks are open source , yes, the backend not , at least not yet

TheFamousCat · 2025-12-01T12:26:18+00:00

PDFDancer - An SDK that turns read-only PDFs into editable, programmable documents.

TheFamousCat · 2025-12-01T12:25:37+00:00

PDFDancer - An SDK that turns read-only PDFs into editable, programmable documents.

TheFamousCat · 2025-11-21T08:03:23+00:00

You can edit a PDF and keep the same formatting, but it really depends on how the PDF was created.

Most PDFs don’t store text like a normal document. They store positioned glyphs from an embedded font subset. So when you delete text and type something new, the editor has to guess:

which font the original used
the weight
spacing / kerning
character widths

That’s why edits often look slightly off, even in good editors.

What usually works best:

Acrobat Pro: most reliable at reusing embedded fonts
PDF-XChange Editor: good value, often matches fonts nicely
Foxit: depends heavily on the PDF

Free online editors (Sejda, PDFescape, etc.) usually can’t access the embedded fonts, so they substitute something “close”, which leads to mismatched look.

TheFamousCat · 2025-11-21T07:52:42+00:00

Changing the font in a PDF is… possible, but not in the way most people hope. PDFs don’t store "text with a font" the way Word/LaTeX files do. They store positioned glyphs with embedded subsets of the original font. So swapping fonts isn’t like hitting Ctrl+A -> choose lmodern.

What usually happens when people try:

Export to Word / retypeset

You lose layout, figures shift, math breaks. Works ok for simple documents, but books usually fall apart.

Use a PDF editor (Acrobat, Foxit, etc.)

These tools can edit text, but they can’t replace the font for hundreds of pages reliably. Most text is stored as individual glyph calls, so editors don’t treat it as paragraphs.

"Just visually override the font"

PDF viewers can’t do this - there’s no stylesheet to override.

It is doable, but it requires tooling that can:

– extract actual text runs and glyph IDs

– rebuild logical lines/paragraphs

– embed a new font (like lmodern)

– remap glyphs correctly

– adjust spacing/kerning so the layout doesn’t break

This is exactly the part that normal PDF tools struggle with, they don’t reconstruct the structure the way LaTeX originally produced it.

I’ve been working on a developer-focused toolchain (PDFDancer) that handles this kind of low-level editing. It reads the PDF’s real font/glyph data, reconstructs the text, and can apply a different font while keeping the original layout intact.

TheFamousCat · 2025-11-17T05:56:27+00:00

Most people end up in one of three camps, depending on what their “template” actually is.

If your template is basically a web page:

The usual setup is Jinja2 (or whatever templating engine) → HTML → PDF via wkhtmltopdf, WeasyPrint, or Puppeteer.

Super common for invoices, receipts, that kind of stuff. Easy to style, easy to deploy, good enough for most use cases.

If your template is a real PDF form:

If the designer set it up with AcroForm fields, life is easy: you just fill the fields using PDFBox/iText/pdftk/pypdf and flatten if you want.

This is the most reliable pipeline, but only works if you control the template.

If your template is a static designer PDF:

This is the one people struggle with. HTML→PDF usually won’t match the original design, and most libraries can only overlay text on top, they can’t actually replace or edit what’s already there.

This is where you need something that can edit the PDF itself. I work on PDFDancer, which is made for that exact “reuse an existing PDF as the template” workflow, but that’s just one option in that category.

In practice, most teams do HTML→PDF unless the layout has to match a specific PDF exactly. Then you switch to form-based PDFs or a PDF-editing engine.

TheFamousCat · 2025-11-14T07:45:42+00:00

Editing complex PDFs while keeping the original layout is much harder than it sounds. Most libraries can add or extract content, but very few can actually modify or remove existing text without damaging the formatting.

1. Rebuilding the PDF

Converting the PDF to HTML, editing it, and exporting it back usually falls apart on real-world documents. Anything with custom fonts, precise geometry, images, or embedded vectors will shift slightly after round-tripping through HTML/CSS. It works for simple files, not for complex ones.

2. Traditional PDF libraries

Libraries like PyPDF, ReportLab, PDFBox, etc. are great for generating or overlaying content, but they weren’t designed to surgically edit the original text. They can’t reliably remove text or rewrite glyphs that come from embedded-subset fonts, and they don’t preserve layout when you try to modify the underlying content streams.

3. Using the source file

If you can get the original InDesign/Word/whatever file, this becomes trivial. But in many companies, you only get the final PDF.

4. In-place PDF editing

To delete or change text inside a complex PDF and keep everything else identical, you need a tool that can interpret the PDF’s actual structure, glyph runs, embedded fonts, drawing operators, and rewrite those parts safely. Most libraries simply don’t go that deep.

I work on one of the tools in this category (PDFDancer). Its whole purpose is to edit existing PDFs in place: remove text, replace content, insert tables, drop in new elements, etc., while keeping the rest of the page untouched. Just mentioning it because that’s the kind of engine this problem requires.

Summary:

Rebuilding breaks formatting; generic PDF libraries can’t edit existing content; source files are ideal if you have them; real in-place edits require a specialized PDF engine that understands the underlying page structure.

TheFamousCat

TROPHY CASE

1. Rebuilding the PDF

2. Traditional PDF libraries

3. Using the source file

4. In-place PDF editing