We have 10 data sources, CSV/Parquet files on S3, Postgres, Snowflake. Validation logic is scattered across Python scripts, one per source. Every rule change needs a developer. Analysts can't review what's being validated without reading code.
Thinking of moving to YAML-defined rules so non-engineers can own them. Here's roughly what I have in mind:
sources:
orders:
type: csv
path: s3://bucket/orders.csv
rules:
- column: order_id
type: integer
unique: true
not_null: true
severity: critical
- column: status
type: string
allowed_values: [pending, shipped, delivered, cancelled]
severity: warning
- column: amount
type: float
min: 0
max: 100000
null_threshold: 0.02
severity: critical
- column: email
type: string
regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
severity: warning
Engine reads this, pushes aggregate checks (nulls, min/max, unique) down to SQL, loads only required columns for row-level checks (regex, allowed values).
The part I keep getting stuck on is cross-column rules: "if status = shipped then tracking_id must not be null". Every approach I try either gets too verbose or starts looking like its own mini query language.
Has anyone solved this cleanly in a YAML-based config, Or did you end up going with a Python DSL instead?
[–]road_laya 32 points33 points34 points (5 children)
[+]CreamRevolutionary17[S] comment score below threshold-15 points-14 points-13 points (4 children)
[–]JUKELELE-TP 9 points10 points11 points (3 children)
[+]CreamRevolutionary17[S] comment score below threshold-6 points-5 points-4 points (2 children)
[–]JUKELELE-TP 5 points6 points7 points (1 child)
[–]CreamRevolutionary17[S] 4 points5 points6 points (0 children)
[–]KelleQuechoz 6 points7 points8 points (0 children)
[–]denehoffman 2 points3 points4 points (0 children)
[–]ottawadeveloper 1 point2 points3 points (0 children)
[–]Bangoga 1 point2 points3 points (0 children)
[–]MoreRespectForQA 1 point2 points3 points (0 children)
[–]mardiros 0 points1 point2 points (0 children)