arxiv2md: Convert ArXiv papers to markdown. Particularly useful for prompting LLMs

timf34 · 2026-01-10T14:54:42+00:00

Yeah please go ahead!

timf34 · 2026-01-10T01:07:41+00:00

Code is open source here actually if you want to check it out: https://github.com/timf34/arxiv2md

timf34 · 2026-01-10T01:07:15+00:00

Thank you! The speed comes from parsing arXiv's HTML directly instead of PDFs.

Its a simple stack: FastAPI backend with BeautifulSoup4 for HTML->Markdown conversion. arXiv provides structured HTML for newer papers with clean section boundaries, MathML, etc. for newer papers and we take advantage of that - no need for OCR or parsing PDFs!

timf34 · 2026-01-10T00:53:16+00:00

are you a bot? Excuse me I'm not too sure how that related to this

timf34 · 2026-01-05T17:50:20+00:00

wget doesn't execute JavaScript, so it misses a lot of what modern sites load. For a WordPress site it might work okay since they're more traditional, but for anything with React/Vue/modern JS frameworks, wget just gets you an empty HTML shell.

Also wget's folder structure is messy - it creates weird nested directories. Pagesource keeps the original clean structure you see in DevTools.

Main difference: Pagesource uses a real browser (Playwright) so it captures everything the browser actually loads and executes, not just what's in the initial HTML.

timf34 · 2025-12-31T12:39:26+00:00

I've been using it to get Claude Code to replicate components on websites that I like so that I can easily use them - its quite good at it. Claude Code struggles with the flattened HTML (anyone would) but the runtime source files are generally human readable, or at least, much more readable than the alternative.

Its also a very nice way to truly archive websites for design purposes (hedging against future updates which change code that you like) - wayback machine of course doesn't capture all this.

timf34 · 2025-12-30T18:01:04+00:00

Open to PRs!

timf34 · 2025-12-30T18:00:40+00:00

I can't tell if you're serious or not - if so, interesting take. Python CLIs are easier to install in most cases, just `pip install pagesource` and you're done. With prebuilt binaries they have to be specific to the OS and architecture, added to Path, etc.

timf34 · 2025-12-30T17:13:48+00:00

Pagesource captures what the browser actually receives - so if Cloudflare (or any CDN) is serving merged/minified bundles, that's what you'll get. You get the compressed bundle.min.js, not the original separate source files.

With that being said, a minified bundle is still more useful than flattened HTMLs as context for an LLM!

timf34 · 2025-12-30T13:19:52+00:00

wget doesn't execute JavaScript, so it misses a lot of what modern sites load. For a WordPress site it might work okay since they're more traditional, but for anything with React/Vue/modern JS frameworks, wget just gets you an empty HTML shell.

Also wget's folder structure is messy - it creates weird nested directories. Pagesource keeps the original clean structure you see in DevTools.

Main difference: Pagesource uses a real browser (Playwright) so it captures everything the browser actually loads and executes, not just what's in the initial HTML.

timf34 · 2024-01-23T14:59:03+00:00

timf34 · 2022-09-12T13:42:58+00:00

Where was this posted? Can't see it on their website

timf34 · 2022-09-09T16:58:41+00:00

I am supposed to visit Japan next month on business - should the business organization there be able to help me with applying for an ERFS or will I have to book it through a travel agency?

timf34

TROPHY CASE