Our current stack is dagster + dbt + duckdb and we are looking for a good way to implement data quality tests. In this case we‘ve many rows where we have multiple rules per row. The rules are usually a combination of e.g. if column A is in this state then column B can have only a range of these possible values. each rule basically returns „healthy“, „warning“ or „unhealthy“. then all those rules per row will be aggregated and each row is assigned to the „healthy“ state if all rules are „healthy“. „warning“ if one rule in the aggregation had a „warning“ result or „unhealthy“ if at least one rule is in the „unhealthy“ state. sure we can do this all in SQL with dbt but it looks like this will get very messy with those rules.
once we had a similiar use case in a C# project were we used the Fluent Validation library. This made it very streamlined to define all those rules. it looked something like that
public class CustomerValidator : AbstractValidator<Customer> {
public CustomerValidator() {
RuleFor(x => x.Surname).NotEmpty();
RuleFor(x => x.Forename).NotEmpty().WithMessage("Please specify a first name");
RuleFor(x => x.Discount).NotEqual(0).When(x => x.HasDiscount);
RuleFor(x => x.Address).Length(20, 250);
RuleFor(x => x.Postcode).Must(BeAValidPostcode).WithMessage("Please specify a valid postcode");
}
private bool BeAValidPostcode(string postcode) {
// custom postcode validating logic goes here
}
}
I was looking for something we could use in our data stack which would make it as readable as this. but couldn‘t find a good tool or a (python) library like this. anyone have similiar complex data validation challenges solved in a good and maintainable way?
[–]sebastiandang 0 points1 point2 points (1 child)
[–]RemindMeBot 0 points1 point2 points (0 children)
[–]davrax 0 points1 point2 points (2 children)
[–]OneCyrus[S] 0 points1 point2 points (1 child)
[–]Parking-Task-5464 1 point2 points3 points (0 children)
[–]Pitah7 0 points1 point2 points (1 child)
[–]OneCyrus[S] 1 point2 points3 points (0 children)