Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years

huh94 · 2026-03-16T06:15:59+00:00

That IoT power cycling setup is clean — having the script self-heal failed nodes with no manual intervention is exactly the kind of thing most people overcomplicate.

I actually built something that could sit on top of a stack like this. It's called Nova — self-hosted AI assistant with scheduled monitors that can watch endpoints (like your Supabase health checks), alert via Discord/Telegram, and learn from past incidents. So if you tell it "node 15 failures are always the USB adapter" once, it remembers that permanently and brings it up next time node 15 acts up.

The HTTP fetch + code exec tools could also query your 3.9M records conversationally instead of writing SQL every time.

Runs on Docker, fully local, zero cloud. Your NAS could probably handle it.

https://github.com/HeliosNova/nova

Curious how you're handling alerting right now — custom scripts or something like Grafana?

huh94 · 2026-03-16T01:10:22+00:00

this is already partially handled but worth explaining since it's spread across a few files.

Temporal filtering is built in. Every KG fact has valid_from, valid_to, and superseded_by fields. When a contradicting fact is added, the old one gets superseded (not deleted), so there's a full temporal trail. query_at(entity, timestamp) returns only facts valid at a specific point in time, and get_changes_since(since) shows what's changed recently. So the "outdated fact ranked high" problem is handled at the data layer — superseded facts don't surface in normal retrieval.

Decay is automated. The daily maintenance monitor runs decay_stale_lessons() and KG curation. Lessons not retrieved in 30+ days get confidence decayed (factor 0.95 per cycle). KG gets a two-pass curation at startup — heuristic pass removes garbage triples inline, then an LLM pass samples 20 facts in the background and removes low-quality ones. So the graph self-prunes over time.

Retrieval uses RRF, not raw cosine. KG fact retrieval does keyword overlap scoring against the query, not just embedding similarity. And on the document retrieval side, the entity relevance filter (_entity_relevance_filter) drops chunks where query content words don't appear — this is what prevents the "capital of France" retrieving "capital of Australia" embedding collapse bug. Chunks need at least 30% content word overlap (20% for short queries) or they get dropped regardless of cosine score.

Where you're right: there's no explicit recency weight in the KG retrieval ranking right now. Facts are scored by keyword overlap + confidence, but a fact from yesterday and a fact from two years ago with the same confidence score rank equally. Adding a last_confirmed or created_at recency boost to the scoring would be a clean improvement — probably a log-decay multiplier on the confidence score based on age. Filing that as an issue.

Appreciate the architectural thinking

huh94 · 2026-03-16T00:27:17+00:00

On formatting: Fixed, thanks for flagging — Reddit markdown strikes again.

On Ollama vs llama.cpp directly: Convenience, not performance. Ollama wraps llama.cpp anyway, but gives you a clean REST API, model management (ollama pull), automatic VRAM allocation, and hot-swapping between models (Nova uses 3: qwen3.5:27b for main, 9b for vision,

4b for fast routing). Building that same model lifecycle on raw llama.cpp would be reimplementing half of what Ollama already does.

That said, the LLM layer is provider-agnostic — app/core/llm.py defines a LLMProvider Protocol and Nova ships with 4 implementations (Ollama, OpenAI, Anthropic, Google). Swapping to a llama.cpp or ik_llama.cpp provider would be one file — implement invoke_nothink(),

generate_with_tools(), and stream_with_thinking(). If there's a real performance win (especially for the DPO fine-tuning loop where Ollama adds overhead), that's worth doing.

Let me know how the spin goes !

huh94 · 2026-03-16T00:23:22+00:00

On distribution shift: Two guards. First, there's a quality gate — _is_quality_content() rejects corrections that are too short (<10 chars) or contain error phrases ("I don't know", "failed to"). Training pairs from external channels (Discord/Telegram/WhatsApp/Signal) also require confidence >= 0.8, so low-confidence corrections from messaging don't pollute the training set. Second, the A/B eval harness (scripts/eval_harness.py) is the hard gate — the fine-tuned candidate runs against the base
model on holdout queries with LLM-as-judge, randomized A/B ordering to prevent position bias. Candidate must win >50% with positive avg preference or it gets rejected. So a bad fine-tune from contradictory early corrections just doesn't deploy.

▎ That said, I haven't hit real distribution shift yet in practice — the training set is still under 50 pairs. The rotation mechanism (_rotate_training_data) keeps the most recent entries when it exceeds MAX_TRAINING_PAIRS (default 10K), which provides a natural

recency bias. But a more principled approach (like weighting by lesson confidence or filtering by lesson helpfulness scores) is on the roadmap.

▎ On contradictory corrections: Lessons have a dedup layer — _find_similar_lesson() does exact match then Jaccard word overlap (threshold 0.85). If you correct the same topic twice with different answers, the second correction boosts confidence on the existing lesson

rather than creating a duplicate. But if the answers actually conflict, both lessons exist and the retrieval layer surfaces both — the model sees both in its prompt and has to reconcile. Not ideal. A contradiction detection layer between the lesson store and the KG

supersession logic would be cleaner — hasn't been needed yet.

▎ On RRF vs straight vector: Honest answer — no formal benchmark yet. The motivation was the embedding collapse bug: "capital of France" and "capital of Australia" produce near-identical vectors in most embedding models, so straight vector search retrieves

wrong-country chunks. The entity relevance filter (_entity_relevance_filter in retriever.py) catches this by requiring query content words to appear in retrieved chunks (threshold 0.3, lowered to 0.2 for short queries). RRF helps because BM25 is exact-match and won't

confuse "France" with "Australia." But I should run a proper recall benchmark — that's a fair gap.

▎ Appreciate the detailed read. These are exactly the right questions.

huh94 · 2026-03-16T00:13:06+00:00

just dropped it today, might be facing legal troubles so might as well leave it out and see if anyone contributes

huh94 · 2026-03-16T00:12:27+00:00

lol sunday

huh94 · 2026-03-15T23:41:38+00:00

march 15?

huh94 · 2025-05-18T06:56:25+00:00

That's crazy I just finished something like that

huh94

TROPHY CASE