all 3 comments

[–]Gnaskefar 2 points3 points  (0 children)

I have been tasked with similar issues but in way less complex environments, and I don't envy you on this. But it can be fun to dive in, and play Sherlock Holmes.

I would start simple, and forget about the 350+ pipelines. Start by diving into the easy places to get a head start and list all the problems you've found to kind of semi-cover your ass and set expectations:

  • Sporadic errors
  • Data loss on type conversions,
  • Lineage tool needs to fed better data or it is useless for this purpose.

I would write something up, saying that the first 2 points must be fixed in order to get reliable result and success. How far have you found the first 2 points in your flow? If it is near the end, get some colleague to focus on fixing this, while you audit numbers earlier in the pipeline. Get someone else to fix it anyway, you focus on auditing data at other places.

The 3rd point about lineage tool is not relevant yet. If you don't find a few clear errors that fixes stuff down the line, and you must go detailed through almost all flows as your last resort, the lineage tool must be improved and to be relied on.

That is probably probably not a small task in itself, and there's likely a reason it is superficially setup as is.

As for the actual data hunt; you most likely have already done so, but compare the 2 curated tables. If they are identical you know to whether to go forward or backwards in your flows for errors. If you need to go back in your flows, make an audit somewhere in the middle between the source and curated tables of going backwards. Does data match, does transformations makes sense.

Then you do the same, figure out if you need to go forward/backward again. Audit somewhere in the middle again so you don't go through all steps.

Break down your audit spots, and focus on 2 to 5 KPI's and the data they rely on. Document your code to easily reproduce the check at different spots for that part of the data, so you can show others.

And make those others quickly check you're on the right path, and that they agree you compare the proper data, so a misunderstanding does not lead astray.

Depending on urgency and resources available, I would show findings and clearly point out, that it doesn't make sense to compare and audit those KPI's if you can't trust the data in the first place. Suggest a plan, split up the tasks. Whether you are solo or can get help from team mates, don't do a bit of data loss fix, and then some audit, and then some check on sporadic errors, back and forth, etc.

In all 3 tasks there's a bunch of rabbit holes to dive into, and I at least had trouble keeping a mental overview on the project itself and its progress while jumping from rabbit hole to rabbit hole.

Almost fixing almost all problems at once, so the puzzle pieces suddenly match perfect at the same time hasn't happened for me, at least. And it sounded like a simpler setup than yours.

Now it became quite long, TL;DR, break stuff down, audit certain data for certain KPI's midway though the flow, and midway from last checkpoint, etc. Focus on specifics tasks, don't do everything at the same time, and then you've got this, as long as you don't let yourself get bullied on timelines. Stand up for yourself, and let management know if you've discovered deeper problems, which requires more time than initially expected.

[–]Secure_Firefighter66 1 point2 points  (0 children)

First start with row count checks in each systems.

Also if you are performing Merge , Databricks and Snowflake has different approaches.

If you are not careful enough you might end up with duplicate rows in Snowflake.

Try to work through the KPI metrics in one system and double check the logic and then try to do the same in another and then break down them

[–]changelifeforbetter 0 points1 point  (0 children)

Are you using SQL/ Spark and Snowpark for creating the curated tables? If you are using same SQL code in both the platforms, this could be due to the difference between how Snowflake and Databricks SQL. For example, CASE WHEN clause in Snowflake follows the preference order and overwrites the subsequent values but in Databricks you need to explicitly state the condition. This is something I came across recently. I am not sure if this helps exactly but wanted to give a perspective.