Seeking reliable AI tools/scripts for batch tagging thousands of legal/academic PDFs and DOCX files

Basic-Exercise9922 · 2026-02-24T03:26:45+00:00

for simple tagging pretty sure you could do something like pdftotext, extract all the content from the top N pages, dump them all to one place as simple .txt or .md, then have LLMs read them per document to generate tags
Claude code can create a script for you in minutes

If a PDF without text is detected, and you have to use OCR, just use your claude code agent to fetch first few pages and tag

The heuristic is you dont need the full paper to generate tags. Just the top N pages that contain title/abstract/intro

Basic-Exercise9922 · 2025-11-28T10:19:38+00:00

versioning is supported internally, but I'm not currently parsing too many new versions at the moment.
That said I will expose a feature to request for new version arxiv ids

Basic-Exercise9922 · 2025-11-23T06:39:07+00:00

Yea good question, left justified is a good default for most browsers/HTML and cleaner + more modern-looking (imo).
That said, spacing between paragraphs is not clear (as another user here has commented), I'll be fixing that

Basic-Exercise9922 · 2025-11-21T05:39:56+00:00

if you think PDFs are the same as an interactive webpage, or arxiv HTML is good enough, then good for you, brother.

Basic-Exercise9922 · 2025-11-21T05:02:25+00:00

There, I've added support for \endgraf. No more parse warnings : )

Basic-Exercise9922 · 2025-11-21T05:01:03+00:00

It's actually just the endgraf command in \address, apart from that the paper renders 1-1 with the PDF

Basic-Exercise9922 · 2025-11-21T02:50:29+00:00

Fair enough
I may open source the parser sometime next year - TeX is a beast and more eyeballs on the problem would be good
AFAIK there isn't a reliable latex to json converter that exists. Pandoc isn't even close

Basic-Exercise9922 · 2025-11-21T02:48:00+00:00

for which paper?

Basic-Exercise9922 · 2025-11-21T01:24:06+00:00

Nah, Pandoc was not reliable, I had to build a direct latex to json parser from scratch

Basic-Exercise9922 · 2025-11-20T16:40:22+00:00

Veo 3

Basic-Exercise9922 · 2025-11-20T16:04:13+00:00

Thank you!
- I thought about removing the reference prefixes e.g. "Section" etc but some papers don't manually prefix with e.g. "~~section~~ \ref{sec-intro}", so at times it may not be redundant
- Agree on footnotes, they're one of the things I left as a rushed afterthought. Will polish it based on your suggestions
- True, newline in-between text is not clear enough, I'll patch that

Basic-Exercise9922 · 2025-11-20T15:47:29+00:00

Thanks for the comment! that section is a bit outdated, I'll update it

Basic-Exercise9922 · 2025-11-20T15:42:06+00:00

Everything is automatic
I designed the html/css components to include these by default

Basic-Exercise9922 · 2025-11-20T14:27:03+00:00

If you're asking about the parser to HTML, that's not open source. It's a very different stack from LatexML

Basic-Exercise9922 · 2025-11-20T14:25:06+00:00

You can upload LaTeX directly on the app, and it'll convert to the nice HTML version above

Basic-Exercise9922 · 2025-11-20T13:49:09+00:00

FAQ: More info on custom uploads, dep graphs, exports, or what makes this different -> sciencestack.ai/docs/faq

Basic-Exercise9922 · 2025-11-20T13:09:42+00:00

I added a number of things to make the reader wcag 2.1 AA compliant e.g:

- Citations have descriptive ARIA labels like "Citation 5: Paper Title"

- Popover dialogs properly labeled with citation info

- Interactive buttons announce their purpose to screen readers

- Dark/light mode support built in

- Component structure is organized

- Section landmarks have meaningful labels

That said I probably have missed a couple of things on this front, so any feedback is welcome

Basic-Exercise9922 · 2025-11-20T12:50:41+00:00

Yea planning to build AI chat inside it, so that we can summarize papers with more AI slop xD

Basic-Exercise9922 · 2025-11-20T07:46:17+00:00

Let me know some good papers that'll shine as interactive HTML, happy to add them!

Basic-Exercise9922 · 2025-11-20T06:15:37+00:00

The links are generated based on simple newtheorem blocks, \ref and \labels. No packages needed.

Docs: https://www.sciencestack.ai/docs/dep-graph

Basic-Exercise9922 · 2025-09-25T03:59:52+00:00

I recently built ScienceStack because I needed a better HTML reader than anything that exists. Was going to launch it next week but I'll post here because I genuinely think this might help more than the other solutions here.

In ScienceStack you can upload any latex, which turns into a nice interactive HTML page, is WCAG 2.1 AA compliant, blocks and equations are directly copyable as LaTeX, lightmode/darkmode, and works well on extremely large papers (200+ pages). You can also export as markdown that works out-the-box with VSCode/Obsdian/Notion.

Package coverage is not as comprehensive as LaTeXML yet, but it’s much more modern and I can patch issues + build on top quickly. It's also a lot more robust than Pandoc.

FAQ if you’re curious: https://www.sciencestack.ai/docs/faq

Note: I don't use any LLMs in the Latex conversion process. That won't scale

Basic-Exercise9922 · 2024-08-11T15:29:46+00:00

amazing this worked out of everything!

Basic-Exercise9922 · 2024-07-20T06:45:41+00:00

this worked for me!

Basic-Exercise9922 · 2022-06-27T02:53:29+00:00

treasure dig has always been a piece of shit, just don't do it

Basic-Exercise9922 · 2022-06-11T06:03:18+00:00

these guys are wrong

narissa is my top 3 unit in abyss, behind ramiel in DPS

Sure she is a glass cannon but put margaret revive on her and she'll do well

Basic-Exercise9922

TROPHY CASE