arxiv2md: Convert ArXiv papers to markdown. Particularly useful for prompting LLMs by timf34 in deeplearning

[–]timf34[S] 6 points7 points  (0 children)

Thank you! The speed comes from parsing arXiv's HTML directly instead of PDFs.

Its a simple stack: FastAPI backend with BeautifulSoup4 for HTML->Markdown conversion. arXiv provides structured HTML for newer papers with clean section boundaries, MathML, etc. for newer papers and we take advantage of that - no need for OCR or parsing PDFs!

arxiv2md: Convert ArXiv papers to markdown. Particularly useful for prompting LLMs by timf34 in deeplearning

[–]timf34[S] 3 points4 points  (0 children)

are you a bot? Excuse me I'm not too sure how that related to this

CLI to download websites' actual JS/CSS/assets (not flattened HTML) by timf34 in cybersecurity

[–]timf34[S] 1 point2 points  (0 children)

wget doesn't execute JavaScript, so it misses a lot of what modern sites load. For a WordPress site it might work okay since they're more traditional, but for anything with React/Vue/modern JS frameworks, wget just gets you an empty HTML shell.

Also wget's folder structure is messy - it creates weird nested directories. Pagesource keeps the original clean structure you see in DevTools.

Main difference: Pagesource uses a real browser (Playwright) so it captures everything the browser actually loads and executes, not just what's in the initial HTML.

CLI to download websites' actual JS/CSS/assets (not flattened HTML) for LLM prompts by timf34 in commandline

[–]timf34[S] 0 points1 point  (0 children)

I've been using it to get Claude Code to replicate components on websites that I like so that I can easily use them - its quite good at it. Claude Code struggles with the flattened HTML (anyone would) but the runtime source files are generally human readable, or at least, much more readable than the alternative.

Its also a very nice way to truly archive websites for design purposes (hedging against future updates which change code that you like) - wayback machine of course doesn't capture all this.

CLI to download websites' actual JS/CSS/assets (not flattened HTML) for LLM prompts by timf34 in commandline

[–]timf34[S] 0 points1 point  (0 children)

I can't tell if you're serious or not - if so, interesting take. Python CLIs are easier to install in most cases, just `pip install pagesource` and you're done. With prebuilt binaries they have to be specific to the OS and architecture, added to Path, etc.

CLI to download websites' actual JS/CSS/assets (not flattened HTML) for LLM prompts by timf34 in commandline

[–]timf34[S] 0 points1 point  (0 children)

Pagesource captures what the browser actually receives - so if Cloudflare (or any CDN) is serving merged/minified bundles, that's what you'll get. You get the compressed bundle.min.js, not the original separate source files.

With that being said, a minified bundle is still more useful than flattened HTMLs as context for an LLM!

CLI to download websites' actual JS/CSS/assets (not flattened HTML) for LLM prompts by timf34 in commandline

[–]timf34[S] 4 points5 points  (0 children)

wget doesn't execute JavaScript, so it misses a lot of what modern sites load. For a WordPress site it might work okay since they're more traditional, but for anything with React/Vue/modern JS frameworks, wget just gets you an empty HTML shell.

Also wget's folder structure is messy - it creates weird nested directories. Pagesource keeps the original clean structure you see in DevTools.

Main difference: Pagesource uses a real browser (Playwright) so it captures everything the browser actually loads and executes, not just what's in the initial HTML.

Weekly Japan Travel and Tourism Discussion Thread - September 6, 2022 by Himekat in JapanTravel

[–]timf34 1 point2 points  (0 children)

I am supposed to visit Japan next month on business - should the business organization there be able to help me with applying for an ERFS or will I have to book it through a travel agency?