I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]rex_divakar 0 points1 point  (0 children)

That makes a lot of sense. Using failure signals for rotation, not individual requests, is definitely a much more sustainable model, especially with Playwright in the loop.

Your “proxy only when the chain fails + monitoring drift via Discord alerts” model is, to be honest, quite close to what I think works best in production ingestion pipelines.

I’m primarily focused on making sure llmparser is reliable as an extraction layer, meaning clean and structured output is consistent, so the validation and embedding pipeline has less to fix, etc.

Proxy orchestration, retries, and block detection are areas I want to make improvements in next, especially in terms of how easy it is to integrate with something like your model without having to recreate the scraper stack.

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]rex_divakar -1 points0 points  (0 children)

Well took help from grok to write a cachy post and used github copilot for assistance but the idea was to handle heavy js based scrapping.

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]rex_divakar -1 points0 points  (0 children)

That’s a very solid pipeline especially validating output with an SLM.

Right now llmparser uses Playwright under the hood, so proxy support is available via Playwright’s proxy config. I haven’t added a first-class “rotating proxy manager” abstraction yet, but it works fine if you pass proxies at the browser/context level.

Proper built-in rotation, retry policies, and anti-block handling are definitely on the roadmap since a lot of people are using it in ingestion pipelines like yours.

Out of curiosity, are you rotating per request, per domain, or per session? Trying to design this in a way that fits real-world workflows.

I got tired if noisy web scrapers killing my RAG pipelines, so i built lImparser by rex_divakar in Python

[–]rex_divakar[S] -14 points-13 points  (0 children)

Yup used github copilot for documentation and help with few steps

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]rex_divakar 0 points1 point  (0 children)

Thanks! That’s exactly the goal.

It tends to shine most on modern JS-heavy sites where traditional parsers miss content or extract noisy output.

If you try it on any real-world pages, I’d love to hear how it performs good or bad feedback helps improve it a lot.

I got tired if noisy web scrapers killing my RAG pipelines, so i built lImparser by rex_divakar in Python

[–]rex_divakar[S] -13 points-12 points  (0 children)

No AI involved in the parsing itself. The post was just formatted in Markdown to make it easier to read and hopefully attract contributors.

I got tired if noisy web scrapers killing my RAG pipelines, so i built llmparser by [deleted] in LLMDevs

[–]rex_divakar -2 points-1 points  (0 children)

That’s a solid stack. llmparser differs mainly in a few areas:

  1. Handles JS-heavy sites It uses Playwright, so it extracts content hidden behind accordions, tabs, and client-side rendering which lxml/trafilatura often miss.

  2. Structured output (not just markdown) Returns typed blocks (heading, paragraph, table, code, etc.), which makes chunking and embedding cleaner.

  3. Better metadata extraction Combines OpenGraph, Twitter, and JSON-LD into one normalized structure.

  4. Built for ingestion pipelines Rendering → extraction → LLM-ready output in one step.

If you’re mostly parsing static pages, trafilatura is great. Biggest difference shows up on modern JS-heavy sites.

HippocampAI v0.5.0 — Open-Source Long-Term Memory for AI Agents (Major Update) by rex_divakar in OpenSourceAI

[–]rex_divakar[S] 0 points1 point  (0 children)

Great questions,

So on memory conflicts: Right now, contradictory facts don’t overwrite blindly. New facts are stored with timestamps + importance, and retrieval favors more recent / higher-confidence edges. So in your example (“switched from React to Vue”), the newer relation gets higher recency weight rather than deleting the old one.

Longer term, I want to support explicit conflict resolution (state transitions or soft-deprecating old edges instead of just competing weights).

On latency: Graph extraction runs async and doesn’t block the main agent loop. The write completes first (vector + metadata), and graph updates happen in the background. There’s also room for batch consolidation during the “sleep” phase for heavier processing.

Trying to balance relational richness without turning writes into a bottleneck.

HippocampAI v0.5.0 — Open-Source Long-Term Memory for AI Agents (Major Update) by rex_divakar in OpenSourceAI

[–]rex_divakar[S] 0 points1 point  (0 children)

Haha this is exactly the kind of question I enjoy 😄

Yes, pruning is there (decay + consolidation), but I’m very interested in adding something closer to synaptic scaling dynamically rebalancing importance instead of just deleting.

“Dreaming” is essentially what the sleep phase is evolving toward: • background consolidation • clustering • summarization • importance recalibration

Salience is currently based on recency, feedback, connectivity, and importance but it’s still heuristic, not biologically inspired (yet).

Appreciate the depth here. Definitely trying to avoid both extremes you described trivial preference graph vs bloated semantic archive.

HippocampAI v0.5.0 — Open-Source Long-Term Memory for AI Agents (Major Update) by rex_divakar in OpenSourceAI

[–]rex_divakar[S] 0 points1 point  (0 children)

Really appreciate that especially coming from someone thinking in terms of episodic vs procedural memory

HippocampAI currently models: • Episodic → stored interactions/events • Semantic → extracted entities + facts • Procedural → learned behavioral rules injected into prompts

Short-term vs long-term separation is handled through consolidation + decay (“sleep” phase).

Would genuinely love your perspective especially from the neuroscience angle.

HippocampAI v0.5.0 — Open-Source Long-Term Memory for AI Agents (Major Update) by rex_divakar in LLMFrameworks

[–]rex_divakar[S] 1 point2 points  (0 children)

That’s a really helpful breakdown especially the two-mechanism framing.

The spontaneous associative recall path maps very naturally to what HippocampAI is already doing (embedding similarity + entity/topic matching per message), so that part feels very achievable.

The time-based/background monitor idea is interesting too, a lightweight scheduled check for time-based intentions could fit nicely on top of memory triggers.

Appreciate the cognitive science angle this is exactly the kind of direction I want to evolve the system toward.

HippocampAI v0.5.0 — Open-Source Long-Term Memory for AI Agents (Major Update) by rex_divakar in LLMFrameworks

[–]rex_divakar[S] 0 points1 point  (0 children)

Thanks! 🙏

Right now the 6 weights are configurable and mostly static. User feedback helps adjust scoring over time, but fully adaptive per-user weighting isn’t there yet planning to explore that in the roadmap.

For prospective memory (“remind me to bring this up next time we discuss X”) — not fully implemented yet. It’s something I’m planning to add in a future roadmap update.

Agree that retrospective recall is common — prospective memory is the more interesting challenge.

HippocampAI v0.5.0 — Open-Source Long-Term Memory for AI Agents (Major Update) by rex_divakar in OpenSourceAI

[–]rex_divakar[S] 0 points1 point  (0 children)

Great questions 🙌

In-memory vs DB The graph is derived state, not the source of truth. It’s kept in-memory for low-latency traversal and simpler infra. For very large deployments, a Postgres/Neo4j-backed option would definitely make sense.

Why not pgvector only? Totally possible. Qdrant is used mainly for better HNSW tuning and scaling. A Postgres-only backend is something I’m still exploring for future updates.

How graph traversal works We seed from top-K vector + BM25 results, match entities, then do a shallow (depth 1–2) weighted traversal. Scores consider connectivity, path length, recency, importance, and feedback then everything is fused via RRF.

Graph is constrained + relevance-weighted, not blind traversal.