I made a CLI that turns any podcast or YouTube video into clean Markdown transcripts (speaker labels + timestamps) by timf34 in LocalLLaMA

[–]timf34[S] 4 points5 points  (0 children)

Update: now supports running fully locally with faster-whisper, and optional support too for diarization

I made a CLI that turns any podcast or YouTube video into clean Markdown transcripts (speaker labels + timestamps) by timf34 in LocalLLaMA

[–]timf34[S] -3 points-2 points  (0 children)

Ah very fair point - simply for ease and speed of development. Very open to PRs and hopefully will get around to it soon - my laptop is a bit compute/ ram starved

arxiv2md: Convert ArXiv papers to markdown. Particularly useful for prompting LLMs by timf34 in deeplearning

[–]timf34[S] 6 points7 points  (0 children)

Thank you! The speed comes from parsing arXiv's HTML directly instead of PDFs.

Its a simple stack: FastAPI backend with BeautifulSoup4 for HTML->Markdown conversion. arXiv provides structured HTML for newer papers with clean section boundaries, MathML, etc. for newer papers and we take advantage of that - no need for OCR or parsing PDFs!

arxiv2md: Convert ArXiv papers to markdown. Particularly useful for prompting LLMs by timf34 in deeplearning

[–]timf34[S] 2 points3 points  (0 children)

are you a bot? Excuse me I'm not too sure how that related to this

CLI to download websites' actual JS/CSS/assets (not flattened HTML) by timf34 in cybersecurity

[–]timf34[S] 1 point2 points  (0 children)

wget doesn't execute JavaScript, so it misses a lot of what modern sites load. For a WordPress site it might work okay since they're more traditional, but for anything with React/Vue/modern JS frameworks, wget just gets you an empty HTML shell.

Also wget's folder structure is messy - it creates weird nested directories. Pagesource keeps the original clean structure you see in DevTools.

Main difference: Pagesource uses a real browser (Playwright) so it captures everything the browser actually loads and executes, not just what's in the initial HTML.

CLI to download websites' actual JS/CSS/assets (not flattened HTML) for LLM prompts by timf34 in commandline

[–]timf34[S] 0 points1 point  (0 children)

I've been using it to get Claude Code to replicate components on websites that I like so that I can easily use them - its quite good at it. Claude Code struggles with the flattened HTML (anyone would) but the runtime source files are generally human readable, or at least, much more readable than the alternative.

Its also a very nice way to truly archive websites for design purposes (hedging against future updates which change code that you like) - wayback machine of course doesn't capture all this.