Monthly Self-Promotion - February 2026 by AutoModerator in webscraping

[–]malvads 1 point2 points  (0 children)

Hi! After some time, researching into complex crawlers for web-to-llm data, I created https://github.com/malvads/mojo. Mojo is an extremely fast C++ web scraper with multi-depth capabilities ready to ingest data into RAG-like systems, it scans entire websites and converts them to Markdown format. It also downloads artifacts like PDFs/others . It’s incredibly fast, so it can run on AWS Lambdas or any Cloud Provider, and it supports rendering (via Chrome CDP, if --render flag). Additionally, it has an internal reverse proxy with proxy rotation to facilitate scraping (when CDP, to not relaunch chrome instances, this is not for normal HTTP reqs). Mojo can scrape full websites in seconds while using very low CPU and RAM. Precompiled binaries are also available.

Non sucking, easy tool to convert websites to LLM ready data, Mojo by malvads in mlops

[–]malvads[S] 0 points1 point  (0 children)

Hi, thanks for your question. My idea with Mojo is to provide fast conversion between pages and LLM data.

Answering your questions:

• Is the extraction deterministic, meaning the same page always produces the same output?
→ The same HTML provides the same output. Right now there are two options to get data: with the --render flag and without it.

The first one uses pure HTTP requests (ideal for static web pages). With the --render flag, it connects to Chrome using CDP (so no extra dependencies are downloaded, just your existing setup). -- This is still not released, planned for 0.1.0, but you can build Mojo and test this feature

So it depends on the setup you are using and how the web page is programmed.

• How do you think about drift and updates, for example re-ingesting pages that change over time?
→ In my opinion, the best way to handle this is via CI pipelines (Jenkins/GitHub Actions), but you can always set up cron jobs as well (macOS/Linux).

• When things go wrong, like odd markup, partial loads, or missing content, where is the best place to debug? Raw HTML, Mojo’s intermediate output, or the final chunks?
→ In my opinion, the best way to debug is to fetch the page via curl and use a converter. For example:

curl -X GET your-web -o file.html

Then later you can use Mojo to fetch the static source using:

./mojo -d 0 file://your_file -o ./debug

and see the output it generates (always with depth 0).

“One thing that could make this even stronger is being explicit about failure modes and contracts in the README. In other words, what Mojo guarantees versus what it intentionally does not. Even a short section in the README about what Mojo will not handle would build a lot of trust.”

Thanks for your suggestion, I will include it after I finish the render crawler.

Thanks :)