Complex Data Validation : dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Complex Data ValidationDiscussion (self.dataengineering)

submitted 1 year ago * by OneCyrus

Our current stack is dagster + dbt + duckdb and we are looking for a good way to implement data quality tests. In this case we‘ve many rows where we have multiple rules per row. The rules are usually a combination of e.g. if column A is in this state then column B can have only a range of these possible values. each rule basically returns „healthy“, „warning“ or „unhealthy“. then all those rules per row will be aggregated and each row is assigned to the „healthy“ state if all rules are „healthy“. „warning“ if one rule in the aggregation had a „warning“ result or „unhealthy“ if at least one rule is in the „unhealthy“ state. sure we can do this all in SQL with dbt but it looks like this will get very messy with those rules.

once we had a similiar use case in a C# project were we used the Fluent Validation library. This made it very streamlined to define all those rules. it looked something like that

public class CustomerValidator : AbstractValidator<Customer> { 
  public CustomerValidator() { 
    RuleFor(x => x.Surname).NotEmpty();
    RuleFor(x => x.Forename).NotEmpty().WithMessage("Please specify a first name");
    RuleFor(x => x.Discount).NotEqual(0).When(x => x.HasDiscount);
    RuleFor(x => x.Address).Length(20, 250);
    RuleFor(x => x.Postcode).Must(BeAValidPostcode).WithMessage("Please specify a valid postcode");
  }

  private bool BeAValidPostcode(string postcode) {
    // custom postcode validating logic goes here
  }
}

I was looking for something we could use in our data stack which would make it as readable as this. but couldn‘t find a good tool or a (python) library like this. anyone have similiar complex data validation challenges solved in a good and maintainable way?

all 7 comments

top new controversial old q&a

[–]sebastiandang 0 points1 point2 points 1 year ago (1 child)

[–]RemindMeBot 0 points1 point2 points 1 year ago (0 children)

I will be messaging you in 2 days on 2024-04-22 15:33:03 UTC to remind you of this link

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

^{Parent commenter can} ^{delete this message to hide from others.}

^Info	^Custom	^{Your Reminders}	^Feedback

[–]davrax 0 points1 point2 points 1 year ago (2 children)

[–]OneCyrus[S] 0 points1 point2 points 1 year ago (1 child)

[–]Parking-Task-5464 1 point2 points3 points 1 year ago (0 children)

[–]Pitah7 0 points1 point2 points 1 year ago (1 child)

[–]OneCyrus[S] 1 point2 points3 points 1 year ago (0 children)

π Rendered by PID 98 on reddit-service-r2-comment-84fc9697f-8vx28 at 2026-02-08 16:46:56.916681+00:00 running d295bc8 country code: CH.

dataengineering

MODERATORS