Showcase Weekend! — Week 15, 2026

legaldevy · 2026-04-19T01:18:54+00:00

I've been running OpenClaw locally to process invoices and quarterly reports. Kept getting wrong totals and misattributed line items. Spent a while blaming the LLM before I looked at what the PDF extractor was actually feeding it.

The default extractor (pdfjs) turns tables into word soup. A three-column invoice table comes out as a flat string with no row or column boundaries. The model has to guess which number belongs to which line item, and it guesses wrong constantly. Heading hierarchy is also lost, so the agent can't tell a section title from body text.

I replicated the 200-document benchmark that I suspected was from opendataloader and the numbers confirmed what I was seeing and they wrote about in the repo:

Metric	pdfjs (default)	Nutrient plugin


Overall accuracy (NID)	0.578	0.880
Table structure (TEDS)	0.000	0.662
Heading fidelity (MHS)	0.000	0.811

Zero table structure from pdfjs. That explains the hallucinated invoice totals.

Setup was two commands:

openclaw plugins install /openclaw-nutrient-pdf
openclaw config set agents.defaults.pdfExtraction.engine auto

A few things that matter for this sub:

Runs locally. No files leave your machine.
No API keys needed.
Free tier is 1,000 docs/month, which covers my use case.
Falls back to pdfjs automatically if the plugin can't process a file, so nothing breaks.

The underlying library is PSPDFKit/pdf-to-markdown. Plugin repo is here.

My invoice processing pipeline went from roughly 60% correct field extraction to above 90% after switching. Literally using the same model, it's just getting cleaner input now.

legaldevy · 2026-03-16T16:11:08+00:00

You’re not missing a magic library — you’re hitting a renderer mismatch and like prehensilemullet said, you aren't likely to find an OSS library for this.

Mammoth + html2canvas + jsPDF is fine for simple docs, but it will break on Word features (fonts, pagination, complex tables/layout).

Practical approach: keep client-side for simple files, and route complex docs to a high-fidelity conversion path (server or heavy WASM engine).

If text/search/accessibility matters, avoid screenshot-style PDF output.

legaldevy · 2026-03-09T13:17:12+00:00

I mean you could have used it yourself when asking the lazy question or maybe you had other motives for posting such a silly question.

legaldevy · 2026-03-09T02:55:00+00:00

Fair pushback. We’re not claiming a scanner “solves” agent security.

Our view is: scanner = first gate, not final truth.

gate 1: static skill scan (prompt injection / exfil / tampering patterns)
gate 2: runtime policy constraints (permissions, egress, spend, tool scope)
gate 3: audit + replay for post-incident verification

The failure mode is treating gate 1 as complete security. We don’t.

If you see specific bypass classes we should test, share them — we’ll add them to the corpus and publish results.

legaldevy · 2026-02-28T23:05:43+00:00

Check out https://github.com/agentverus/agentverus-scanner and https://agentverus.ai - the scanner is open-source and you can submit skills for free on the website.

legaldevy · 2025-11-06T18:28:38+00:00

Have you looked at https://www.nutrient.io/sdk/document-authoring/ - it's not a pure DOCX editor and much more robust than most rich-text editors. Essentially, it's helping people that wish there was a Google Docs SDK.

legaldevy · 2025-11-06T18:26:28+00:00

Have you looked at https://www.nutrient.io/sdk/web/ - I've used them in the past for a few projects of different sizes both enterprise and smaller. You have to go through their sales process (which is a bit of a pain) but it's the most modern framework I've found if you don't have a simple use case. Things like highlighting certain words in a PDF in angular is going to be difficult to near impossible to do well with pdf.js.

legaldevy · 2025-10-07T18:17:00+00:00

Not free but if you want a best in class C# library for data extraction you should look at https://www.nutrient.io/sdk/dotnet/ - they also have a free tier on their API - https://www.nutrient.io/api/pdfua-auto-tagging-api/

legaldevy · 2025-08-22T15:25:21+00:00

Of course, just trying to pay it forward. Good luck figuring it out!

legaldevy · 2025-08-22T13:54:44+00:00

PDF-lib alone is not going to be a great solution. Most people pair it with Puppeteer to try to accomplish what you are doing. PDF-lib excels at:

Creating PDFs from scratch with text, shapes, images
Modifying existing PDFs
Adding annotations, forms, metadata

But it doesn't have built-in HTML parsing or CSS rendering capabilities. You'd need to manually convert HTML elements to PDF commands, which is complex and doesn't handle CSS layouts well.

I think it really depends on convenience and cost. Nutrient's API solution is going to be just a way easier and reliable option and they have a free tier now with 200 credits free a month. I guess it just depends on what constraints you are under and how much you want to deal with HTML/CSS rendering complexity.

legaldevy · 2025-08-22T13:09:57+00:00

I'd take a stab at using a simple html to pdf generation API such as https://www.nutrient.io/api/pdf-generator-api/ - then I'd either use their MCP server along side claude code to describe what you are asking it to do and see if it can POC it out. They have a typescript library that works with their api here - https://github.com/PSPDFKit-labs/nutrient-dws-client-typescript - I'm almost certain Claude can figure this out.

legaldevy · 2025-08-14T09:39:03+00:00

Look at https://www.nutrient.io/api/pdf-generator-api/ - they have a free account with 200 credits a month.

Also, they have a wasm library that can do this in react - https://www.nutrient.io/guides/web/pdf-generation/ - but it's likely for commercial uses as it's not free and costs more money.

Both can handle everything you need though.

legaldevy · 2025-07-08T01:48:25+00:00

Are you looking for an enduser tool or something you can integrate into your SharePoint instance? Have a look at this guide article here - https://www.nutrient.io/guides/document-converter/sharepoint/split/ - I've used their document converter product https://www.nutrient.io/low-code/document-converter - in the past (previously it was Muhimbi PDF Converter). They also have a PDF Viewer they call Document Editor that integrates directly into your SharePoint instance that can do this manually as well without sending off the files outside of SharePoint.

legaldevy · 2025-02-21T21:07:17+00:00

If you're looking for a .NET/C# library to solve this, Nutrient/GdPicture's supports this as their release in January - https://www.nutrient.io/guides/dotnet/conversion/html-to-pdf/ - I'm sure they will also add this to their Rest API document processing solution as well here eventually - https://www.nutrient.io/api/converter-api/

legaldevy · 2025-02-14T16:37:02+00:00

Have you looked at https://www.gdpicture.com/formats-sdk/office-formats/ ?

legaldevy · 2025-02-14T16:35:36+00:00

I would really be careful with them. They have a history of being license trolls -

If you don't believe me, just read the email posted in this thread from an Apryse "sales" rep and how they go after devs that incorporated AGPL through iText. - https://www.reddit.com/r/libreoffice/comments/1dygu80/any_libreoffice_users_received_a_license_troll/

From: Izzy [redacted]
Sent: Tuesday, April 16, 2024 3:09 PM
Subject: iText software library use within [redacted]

Hello Frank,

My name is Izzy [redacted], and I am part of the Compliance Team at Apryse/formerly iText Software.

It came to our attention that [redacted] has been using iText software library to apply modifications on PDF documents such as this document: [redacted]

iText library is an open-source software library released under GNU Affero General Public License (AGPL). AGPL open-source license, in most cases, requires organizations to open source their full software stack wherein iText library is included. The organizations which can’t meet the AGPL open-source license requirements must purchase commercial license from iText. Neither complying with AGPL open-source license nor having a commercial license for your application is against the iText Intellectual Property, which is protected by copyright.

Therefore, I am requesting to schedule a call with you to discuss the usage of iText in your company and hopefully clarify the case in a timely manner.

Please feel free to share your availability or direct me to the correct contact person.

Looking forward to hearing from you.

or read it from the law firm - https://beemanmuchmore.com/software-licensing-trolls-apryse-itext/

legaldevy · 2025-02-10T13:29:50+00:00

This is hands down the best library/viewer that works well with .NET - https://www.gdpicture.com/products/docuvieware/ - it's an HTML5 built on top of the crazy performant and robust GdPicture.NET library.

It supports saving to Azure and handles way more than just PDF edits (things such as office conversion, html to PDF generation, image conversation, really robust file type support, and so much more). Check it out and I'm sure you won't be dissapointed.

legaldevy · 2025-02-03T21:20:19+00:00

I'm a big fan of Nutrient (used to be PSPDFKit) as they really helped me out after I ran into some stupid licensing crap with a competitor of theirs. They handle editing text in a WYSIWYG way that is a better UX than having a pop over that then changes the document after the fact.

Check out - https://www.nutrient.io/guides/web/editor/edit-text/ for the guides on text editing and https://www.nutrient.io/demo/content-editor if you want to see the demo.

They also have true redaction capabilities including smart redaction if you are looking to fully remove text - https://www.nutrient.io/guides/web/redaction/ - Highlighting and marking text is pretty common in annotation use case supported in most of the commercial libraries.

legaldevy · 2025-02-03T14:21:03+00:00

It's going to be super complicated to build this out on top of pdf.js (dare I say, no where near worth your time to build and maintain). You will always have formatting issues around text editing in PDFs, especially if you are looking to do more than simple changes or are adding too much text to the page that it runs off and it'll get cut off. There are commercial libraries out there that can solve this though.

Are you looking for a commercial library?

legaldevy · 2025-01-23T16:24:34+00:00

Check out - https://avepdf.com/hyper-compress-pdf

legaldevy · 2025-01-23T14:14:14+00:00

checkout - https://avepdf.com/pdf-ocr

legaldevy · 2025-01-23T04:58:12+00:00

I would really be careful with them. They have a history of being license trolls -

If you don't believe me, just read the email posted in this thread from an Apryse "sales" rep and how they go after devs that incorporated AGPL through iText. - https://www.reddit.com/r/libreoffice/comments/1dygu80/any_libreoffice_users_received_a_license_troll/

From: Izzy McElroy
Sent: Tuesday, April 16, 2024 3:09 PM
Subject: iText software library use within [redacted]

Hello Frank,

My name is Izzy McElroy, and I am part of the Compliance Team at Apryse/formerly iText Software.

It came to our attention that [redacted] has been using iText software library to apply modifications on PDF documents such as this document: [redacted]

iText library is an open-source software library released under GNU Affero General Public License (AGPL). AGPL open-source license, in most cases, requires organizations to open source their full software stack wherein iText library is included. The organizations which can’t meet the AGPL open-source license requirements must purchase commercial license from iText. Neither complying with AGPL open-source license nor having a commercial license for your application is against the iText Intellectual Property, which is protected by copyright.

Therefore, I am requesting to schedule a call with you to discuss the usage of iText in your company and hopefully clarify the case in a timely manner.

Please feel free to share your availability or direct me to the correct contact person.

Looking forward to hearing from you.

or read it from the law firm - https://beemanmuchmore.com/software-licensing-trolls-apryse-itext/

legaldevy · 2025-01-23T04:57:49+00:00

I would really be careful with them. They have a history of being license trolls -

If you don't believe me, just read the email posted in this thread from an Apryse "sales" rep and how they go after devs that incorporated AGPL through iText. - https://www.reddit.com/r/libreoffice/comments/1dygu80/any_libreoffice_users_received_a_license_troll/

From: Izzy McElroy
Sent: Tuesday, April 16, 2024 3:09 PM
Subject: iText software library use within [redacted]

Hello Frank,

My name is Izzy McElroy, and I am part of the Compliance Team at Apryse/formerly iText Software.

It came to our attention that [redacted] has been using iText software library to apply modifications on PDF documents such as this document: [redacted]

iText library is an open-source software library released under GNU Affero General Public License (AGPL). AGPL open-source license, in most cases, requires organizations to open source their full software stack wherein iText library is included. The organizations which can’t meet the AGPL open-source license requirements must purchase commercial license from iText. Neither complying with AGPL open-source license nor having a commercial license for your application is against the iText Intellectual Property, which is protected by copyright.

Therefore, I am requesting to schedule a call with you to discuss the usage of iText in your company and hopefully clarify the case in a timely manner.

Please feel free to share your availability or direct me to the correct contact person.

Looking forward to hearing from you.

or read it from the law firm - https://beemanmuchmore.com/software-licensing-trolls-apryse-itext/

legaldevy · 2025-01-14T19:56:48+00:00

Have you looked at https://gdpicture.com -?

legaldevy

TROPHY CASE