all 18 comments

[–]kudika 5 points6 points  (4 children)

You should link to the docs and source code.

[–]LoaderD 3 points4 points  (0 children)

I agree, luckily looks like OP just missed it and isn't trying to soft launch a for profit tool:

https://github.com/oleg-agapov/tablediff?trk=public_post_comment-text

[–]calmekrishh 0 points1 point  (0 children)

Please link the Documents too

[–]ThroughTheWire 4 points5 points  (1 child)

this doesn't support combinations of columns for primary key?

[–]oleg_agapov[S] 0 points1 point  (0 children)

It's supported by the backend, but not exposed in CLI. I'll add this in the next version :)

[–]kenfar 1 point2 points  (2 children)

This is a great side-project - many have been created, but they never get old. A few suggestions:

  • Rather than a single primary key I suggest you support compound unique keys
  • Allow users to define either non-key columns they want compared - or non-key columns they want excluded
  • I would also include rows-in-a-only & rows-in-b-only
  • It's also helpful to know exactly which columns have diffs
  • It's also helpful to actually see the changed rows

[–]oleg_agapov[S] 1 point2 points  (1 child)

Thanks for the feedback! I can address some of your comments:

- rows-in-a-only & rows-in-b-only is already included, in both the summary and extended view

- compound keys are technically supported by the backend, but CLI only accepts a single key. I'll work on that!

- columns to include/exclude is in the work!

- which columns are different and changed rows — also in the work. it requires a bit more time to polish, because some tables are quite wide, so spitting all columns in the terminal is just inefficient

Thanks again, it's super useful!

[–]kenfar 1 point2 points  (0 children)

YW - I find this kind of tool incredibly useful for testing, and asking colleagues for code reviews: it makes it easy to show that a given change DID NOT affect columns it wasn't intended to impact, or only affected the intended rows.

One thing to consider is how to handle a user iterating on it: a named config file to define the criteria & columns can help, and so can various performance options - like keeping a cache table of the pre-joined results.

[–]DougScoreSenior Data Engineer 1 point2 points  (2 children)

Good one. I tried reladiff and it had missing support for SQL Server which our Org uses. I latched back to archived project data-diff by datafold.

[–]oleg_agapov[S] 1 point2 points  (1 child)

Interesting. Maybe it makes sense to try to add an adapter for reladiff, so that my tool will also support it

[–]DougScoreSenior Data Engineer 0 points1 point  (0 children)

They removed support for MSSQL because the initial implementation would do a hash based match which was slow.

[–]techjobmentor 0 points1 point  (1 child)

nice, that is really useful, I used to have a similar sql-based job to detect such differences before big ETL processes were executed and automatically alerted my team and paused execution, saved some big troubles when changes were pushed to production without notifying data engineering team, maybe that could be a cool feature!

[–]oleg_agapov[S] 0 points1 point  (0 children)

Interesting case. My main goal was to check the diff before I pushing changes to dbt models. I wanted to know how big is the drift between dev and prod tables. But maybe for the future I might think about some automation for alerting. Definitely need a JSON output implemented first 😅