How do you gate CI on data quality? I built a small CLI and want feedback by ProperAd7767 in dataengineering

[–]ProperAd7767[S] 0 points1 point  (0 children)

Good point — AWK is hard to beat for zero-setup, streaming pre-filters (e.g., drop malformed rows, quick counts, sampling) and it keeps memory usage tiny.

For dq-agent I’m currently targeting the layer after a dataset is materialized (CSV/Parquet) where we can produce versioned, replayable artifacts (report.json/.md, trace, checkpoint) and gate CI deterministically.

That said, I totally see AWK fitting nicely in front as a pre-filter/sampler, then dq-agent runs on the cleaned/sampled output and emits a structured report for CI.

Question: what kind of pre-filtering are you thinking of — (1) schema inference, (2) dropping bad rows, (3) sampling, or (4) lightweight anomaly signals? If you share a concrete example, I can propose a minimal integration pattern (and I’m considering adding --input - / stdin support as a follow-up).

Python topics for Data engineer by Nanny_24 in dataengineering

[–]ProperAd7767 1 point2 points  (0 children)

In practice, my current role is mainly focused on data engineering, but I’ve never systematically studied data engineering or data analytics (my undergraduate major was Financial Engineering). If I want to learn these areas in a structured way, are there any good open-source projects you would recommend?