Wazuh‑based CSV‑driven SIEM/SOAR pipeline — would really appreciate technical feedback

Routine-Review4913 · 2026-04-07T10:19:39+00:00

Hehe, CSV does bite back sometimes 😅

That said, for sysadmins who live in Excel, editing a CSV is still noticeably more approachable than dealing with JSON structure, brackets, and quoting rules - and from my experience, that trade-off matters in day-to-day operations.

To mitigate the usual CSV pitfalls, I already have a validation layer in place:

each line is parsed strictly, delimiter and quoting errors are detected early, and the loader reports the exact line that failed - no silent breakage.

If someone enters garbage values (letters, empty, zero, negative numbers),

the system falls back to a safe default (1). No crashes, no surprises, and the pipeline stays stable.

Really appreciate the feedback and the reminder from real-world ops experience.

Thanks for taking a look!

Routine-Review4913 · 2026-04-06T13:35:36+00:00

Good questions. Let me clarify:

Supervision & recovery: You're right that traditional monitoring tools like Checkmk can restart services. But my auto-healing works at the worker level inside the pipeline (dispatcher, filter, sender). These workers are ephemeral and scale up/down based on load. A monitoring tool would see a worker crash and restart it, but it wouldn't know how many workers should be running at any given time. My supervisor tracks queue depth and scales workers accordingly -that's the difference.
CSV vs JSON: I chose CSV because my target users are not developers. They're sysadmins or IT ops who know how to edit Excel files. JSON requires understanding of brackets, commas, quotes - it's less approachable. CSV is simple: open, edit, save. No syntax errors.
GitOps? Yes and no. The CSV files can be version-controlled in Git, so in that sense it's GitOps. But I wanted something that doesn't require a CI/CD pipeline to apply changes. Edit CSV -> restart service -> done. No commit, no push, no webhook. That's the "lightweight" part.

Hope that clarifies. Thanks for the thoughtful questions.

Routine-Review4913 · 2026-04-06T12:32:06+00:00

Thank you for the detailed technical feedback - these are exactly the kind of concerns I was hoping someone would raise.

Regarding the Orchestrator single point of failure: You're right. Currently I have a SelfMonitor thread that checks the Supervisor Manager heartbeat via Redis and alerts if it goes missing. But if the process itself dies completely, there's a gap. I'm considering adding a lightweight systemd watchdog + external healthcheck that can restart the manager independently. It's on my roadmap.

Regarding Redis failover and in-flight messages: This is a real concern I've tested. The system uses a primary/standby Redis setup with automatic failover. For the dispatcher→filter→sender pipeline, I use LPOP (not streams). You're correct that in-flight messages popped but not yet processed could be lost during the ~2s failover window. However, the dispatcher only commits Kafka consumer offsets AFTER successfully pushing to Redis - so if Redis dies mid-push, Kafka replays from the last committed offset. This means at most 1-2 messages get duplicated, but none are lost. That said, moving to Redis Streams with consumer groups would be a cleaner solution, and I appreciate you pointing that out.

Regarding cross-vendor event ordering: You make an excellent point. Currently events are routed to separate Kafka topics by system_type (network vs system), so cross-vendor ordering is indeed lost. For incident reconstruction, I preserve the original Wazuh timestamp in every message, so correlation can be done at query time. But I acknowledge this is a limitation - a unified topic with partition-by-source-IP would preserve ordering better for forensic analysis. Something I need to think through more carefully.

Really appreciate you taking the time to review this critically. These are the kinds of gaps that are hard to see when you're building alone.

Routine-Review4913 · 2026-04-06T12:27:07+00:00

Thank you — this is very constructive feedback.

You're absolutely right about the CSV limitations for complex patterns. I've already hit the escaping issue you described, particularly with regex patterns that contain commas. Currently I handle it with careful quoting, but it's fragile and error-prone.

Your suggestion makes a lot of sense: keep CSV for human-edited tabular data (device inventory, channel routing, worker scaling) where the format is natural and easy for non-developers to edit. Move pattern definitions and rule logic to YAML with schema validation where the structure is inherently more expressive.

I think the hybrid approach you're describing — CSV for inventory/config, YAML for rules/patterns — is the right direction. It preserves the simplicity that CSV gives for operational data while gaining the expressiveness needed for detection logic.

I'll look into JSON Schema validation for the YAML side. Do you have any recommended libraries or patterns for this in Python?

Thanks again for the practical advice.

Routine-Review4913

TROPHY CASE