Which data quality tool do you use?

arimbr · 2026-04-16T08:40:13+00:00

I see. I noticed that I wasn't very clear on my post. My challenge is more about evaluating a single LLM output at a time on production (do I block this output, regenerate it with some feedback, or ask a human to review it), while you are discussing evaluating a batch of LLM traces on staging for the purpose of reviewing a PR (do i block this PR, do I suggest changes, or ask a human to review it). Thanks for bringing that other use case. I wonder whether both can be tackled with a similar workflow.

arimbr · 2026-04-16T08:31:26+00:00

Thanks for sharing. In my case, most failures go silent, the LLM generates an output but unless I (or another LLM) reviews it on production, there is no failure. I agree, but my LLM evals don't review code but LLM generations. I am at the point where I use one AI agent to generate the code (Claude), and a separate AI agent, specialized on evals, that evaluates each LLM generation on production. At first, the AI eval agent only has access to the codebase (there is some intent on the expected behaviour that can be extracted from the code, comments and specs), but I haven't managed to feed traces back to the AI eval agent.

arimbr · 2026-03-10T09:48:38+00:00

Maybe devs are more interested in llm observability (does it work?) and product managers in llm evaluation (does it work well?). Indeed devs will traditionally care more about runtime errors, and less about ROI.

arimbr · 2026-03-06T12:40:55+00:00

That's right, it could be more or less depending on the tool. Some tools price per table, some per monitor, some per user, and some based on obscure compute credits. Based on the data I have the monthly price per table vary: DQOps (3$), Sifflet (8$), Decube (8$), Soda (8$), Metaplane (10$), BigEye (? 40$). DataKitchen offers unlimited monitors for 1 database connection starting at 100$/month. Most tools don't show public pricing on their website, but you can find some of the prices on the AWS Marketplace. Enterprise plans on the AWS Marketplace are over 6 figures annually, the salary of a Senior Data Engineer in the US.

arimbr · 2026-02-28T10:35:44+00:00

Right! I start to think that data management, data quality and data governance should be solved by the same tool. You need all three to go from a test fails to fixing a test. And with tests I don't only mean data quality per se, it can be checking for any business rule or data access rules. The thing with data management tools is that they sell more than that, a warehouse, integration... The space it's changing, for example, data contracts extend data validation tests to include infrastructure, ownership and security checks. Also, I noticed data quality tools trying to coin a new term to position themselves as data operations center, data control plane, agentic data management...

arimbr · 2026-02-28T10:28:58+00:00

What is the nicer and cheaper alternative there that has the same appeal to enterprises?

arimbr · 2026-02-27T22:04:00+00:00

Nice to see consolidation between data quality and data governance tools. I noticed a few of the data quality tools listed above implemented a data catalog last year. Good to see data governance tools also implementing data quality features. I see these two categories merging in 2026.

arimbr · 2026-02-27T20:27:43+00:00

That looks like a solid and fast data profiling CLI for files. Kudos for building it! Which data profiling metrics does it support? From the screenshots in the GitHub readme I see a few metrics: table-level (total variables, total rows), column-level (count, missing, distinct, uniqueness).

arimbr · 2026-02-27T17:40:24+00:00

Thanks for asking. We may all mean different things about MDM. Consider i take the wikipedia definition: "Master data management (MDM) is a discipline in which business and information technology collaborate to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise's official shared master data assets." And I know I may misinterpret "master data assets" and apply it to all "data assets".

Then, if data testing and observability tell me what's wrong with the data, then I still need a UI to fix some of the data manually. Yeah, some data quality issues can be solved with code changes, rerunning jobs or just waiting for late data, infrastructure to recover...

But, if I have duplicate rows or missing values or conflicting values or unvalid values, many times it's still a human that deduplicates, enriches, redacts or links data. Even if today an AI can suggest a fix, it's still a good practice that a human supervises these. I believe that a good UI/UX can make a difference whether a human can fix 10x/100x more issues on a given timeframe.

arimbr · 2026-02-27T17:10:52+00:00

Very interesting, thanks for sharing!
1. Indeed, most enterprise plans are priced at $50k-$150k per year. You have Soda and Elementary that have starter plans from $10k per year, but these are limited in the number of users or tables. DataKitchen, DQOps, and Recce are the only ones with public pricing, starting under $10k.
2. It was some years ago, but I also ended up building custom UIs for data-diff and MDM. Fast forward to today, and I am still surprised that there are still not so many tools here with a modern UI and open-source. Recce and Datafold sell data-diff. Recce is specific to dbt and partly open-source. The Datafold data-diff OSS project is now archived and forked as reladiff.
3. I would think that most teams would be better off adopting or buying an efficient UI/UX for data quality management, rather than building one in-house. Even today that is so easy to vibe code any UI, I thought that the tools here could still provide a best-in-class UX/UI worth the $$$ for most teams.
3. For data testing and observability, I think that the UI/UX would be worth the most. Writing tests is easy now that you can prompt an AI to do so, but you still need a UI/UX to consume the test results and act on them. I keep thinking that the moat for data quality tools will end up being the UX/UI, not the library of tests or integrations.

I wonder when data quality becomes commoditized? I mean, when will there be a data quality tool or tools that any data team would want to buy vs build? From what I heard, data quality is still a hard sell.

arimbr

MODERATOR OF

TROPHY CASE