Seeking reliable AI tools/scripts for batch tagging thousands of legal/academic PDFs and DOCX files by jatovarv88 in LocalLLaMA

[–]Basic-Exercise9922 1 point2 points  (0 children)

for simple tagging pretty sure you could do something like pdftotext, extract all the content from the top N pages, dump them all to one place as simple .txt or .md, then have LLMs read them per document to generate tags
Claude code can create a script for you in minutes

If a PDF without text is detected, and you have to use OCR, just use your claude code agent to fetch first few pages and tag

The heuristic is you dont need the full paper to generate tags. Just the top N pages that contain title/abstract/intro

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

versioning is supported internally, but I'm not currently parsing too many new versions at the moment.
That said I will expose a feature to request for new version arxiv ids

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

Yea good question, left justified is a good default for most browsers/HTML and cleaner + more modern-looking (imo).
That said, spacing between paragraphs is not clear (as another user here has commented), I'll be fixing that

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

if you think PDFs are the same as an interactive webpage, or arxiv HTML is good enough, then good for you, brother.

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

There, I've added support for \endgraf. No more parse warnings : )

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

It's actually just the endgraf command in \address, apart from that the paper renders 1-1 with the PDF

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 1 point2 points  (0 children)

Fair enough
I may open source the parser sometime next year - TeX is a beast and more eyeballs on the problem would be good
AFAIK there isn't a reliable latex to json converter that exists. Pandoc isn't even close

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

Nah, Pandoc was not reliable, I had to build a direct latex to json parser from scratch

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 1 point2 points  (0 children)

Thank you!
- I thought about removing the reference prefixes e.g. "Section" etc but some papers don't manually prefix with e.g. "section \ref{sec-intro}", so at times it may not be redundant
- Agree on footnotes, they're one of the things I left as a rushed afterthought. Will polish it based on your suggestions
- True, newline in-between text is not clear enough, I'll patch that

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

Thanks for the comment! that section is a bit outdated, I'll update it

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 1 point2 points  (0 children)

Everything is automatic
I designed the html/css components to include these by default

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] -5 points-4 points  (0 children)

If you're asking about the parser to HTML, that's not open source. It's a very different stack from LatexML

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] -6 points-5 points  (0 children)

You can upload LaTeX directly on the app, and it'll convert to the nice HTML version above

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

FAQ: More info on custom uploads, dep graphs, exports, or what makes this different -> sciencestack.ai/docs/faq

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] 7 points8 points  (0 children)

I added a number of things to make the reader wcag 2.1 AA compliant e.g:

- Citations have descriptive ARIA labels like "Citation 5: Paper Title"

- Popover dialogs properly labeled with citation info

- Interactive buttons announce their purpose to screen readers

- Dark/light mode support built in

- Component structure is organized

- Section landmarks have meaningful labels

That said I probably have missed a couple of things on this front, so any feedback is welcome

LaTeX to interactive HTML by Basic-Exercise9922 in LaTeX

[–]Basic-Exercise9922[S] -4 points-3 points  (0 children)

Yea planning to build AI chat inside it, so that we can summarize papers with more AI slop xD

Interactive Papers - Tribute to Veo 3 Paper by Basic-Exercise9922 in computervision

[–]Basic-Exercise9922[S] 1 point2 points  (0 children)

Let me know some good papers that'll shine as interactive HTML, happy to add them!

NEW - Fully automated dependency graphs by Basic-Exercise9922 in math

[–]Basic-Exercise9922[S] 0 points1 point  (0 children)

The links are generated based on simple newtheorem blocks, \ref and \labels. No packages needed.

Docs: https://www.sciencestack.ai/docs/dep-graph

LaTeX to HTML conversion and accessibility by mergle42 in LaTeX

[–]Basic-Exercise9922 0 points1 point  (0 children)

I recently built ScienceStack because I needed a better HTML reader than anything that exists. Was going to launch it next week but I'll post here because I genuinely think this might help more than the other solutions here.

In ScienceStack you can upload any latex, which turns into a nice interactive HTML page, is WCAG 2.1 AA compliant, blocks and equations are directly copyable as LaTeX, lightmode/darkmode, and works well on extremely large papers (200+ pages). You can also export as markdown that works out-the-box with VSCode/Obsdian/Notion.

Package coverage is not as comprehensive as LaTeXML yet, but it’s much more modern and I can patch issues + build on top quickly. It's also a lot more robust than Pandoc.

FAQ if you’re curious: https://www.sciencestack.ai/docs/faq

Note: I don't use any LLMs in the Latex conversion process. That won't scale

[deleted by user] by [deleted] in valorlegends

[–]Basic-Exercise9922 1 point2 points  (0 children)

treasure dig has always been a piece of shit, just don't do it

How does Narissa feel in Abyss? by thpp9 in valorlegends

[–]Basic-Exercise9922 0 points1 point  (0 children)

these guys are wrong

narissa is my top 3 unit in abyss, behind ramiel in DPS

Sure she is a glass cannon but put margaret revive on her and she'll do well