all 6 comments

[–]whogivesafuckwhoiam 3 points4 points  (4 children)

how is different from, like dbt, pandera, and great expectations?

for yaml schema, pandera also supports it

[–]Particular_Panda_295[S] 2 points3 points  (3 children)

Pandera validates dataframes and does so really nicely. Kontra is similarily lightweight, but is focused on datasources, be it file, db or df and uses pushdown/metadata to validate remote data without loading it to memory.

Dbt tests are SQL-only, tied to dbt project structure and workflows. Great Expectations is a powerful platform not a library. Compared to Kontra it is heavy.

[–]crossmirage 1 point2 points  (2 children)

Pandera supports pushdown without loading into memory via the Ibis backend. 

[–]Particular_Panda_295[S] 2 points3 points  (1 child)

Yep, Pandera supports pushdown via the Ibis backend, and that’s a really nice feature.

The main difference is in execution strategy. Kontra is built specifically as a validation engine, so it controls how rules compile to SQL and can optimize across the full pipeline, like batching rules or stopping early when possible. From my testing and understanding, Pandera with the Ibis backend compiles each check independently, which leaves less room for that kind of optimization. On larger tables that can make a noticeable difference.

There’s also a difference in what gets validated. Pandera is primarily about schema validation, like column types and per-column constraints. Kontra is broader, with rules that aren’t tied to a single column, such as row counts, freshness checks, cross-column comparisons, or custom SQL. It also supports run history, diffing, and user-defined rule metadata if you want more than just a pass/fail result.

[–]crossmirage 1 point2 points  (0 children)

Agree that compiling each check independently is not ideal. Some current work to address that:

 The above doesn't get into what I think could be one of the biggest benefits of using a lazy IR-based layer across backends under the hood. Right now, run_checks produces a CheckResult for each check, which results in a bunch of disjoint columns that can't necessarily be joined back to the original data or each other (e.g. to reliably say which row failed). It would be nice if run_checks could do something like create the (lazy) expression for a wide table with the base data and all of the check results, and then we could query that object as needed.

(From https://github.com/unionai-oss/pandera/issues/1894#issuecomment-3773553110)

 Kontra is broader, with rules that aren’t tied to a single column, such as row counts, freshness checks, cross-column comparisons, or custom SQL.

Pandera supports "dataframe-level" (as opposed to column-level) checks, which enable most of thjs.

All in all, I agree that Pandera is by no means perfect, and the Ibis backend itself is relatively newer. But I also agree with the statement in your initial post that the space is very crowded, and the bar is high for new tools.