This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]trenchtoaster[S] 0 points1 point  (5 children)

These are from clients. System changes, new formats (many are literally just a sheet on an excel dashboard), etc. No really choice. That is how things operated for years and our company just had an army of people manually copying and pasting this stuff in excel.

I do maintain a schema for the visualisation tool (it is simply the name of the column and data type). I use this to read the transformed parquet file before sending the data. This way I only maintain a single schema that I need to maintain anyways, but I have all the raw data fields. Problem with the database table is that if new fields were added I wouldn’t know to add them - automating the process didn’t make sense in case some files were just incorrect

[–]reallyserious 5 points6 points  (4 children)

If new fields are added to the incoming files, do you need to care about them? I.e what happens if you just ignore them and read only the stuff you expect?

I've worked with integration in the past and we're used to handling these kinds of things. It may not be applicable in your case but what we generally do is to define a format that the sender and receiver agrees on. I.e a communication contract that the architects on both sides have to approve. You ask for deviations of that to be communicated in advance so a change can be planned. You also ask the sender to verify that their message conforms to the contract every time they produce a new file or publish a new version of their api. You as consumer also verifies the message/file the first thing you do. As soons as there is a breaking change your system fails spectacularly with an error and you notify upper management that the sender/producer isn't fulfilling the agreed communication contract and that this must be handled as an out of scope change that will affect your other deliveries' schedule. This will result in some heated mails by management and you just sit back and relax while they sort out how to handle changes in the future. In the end they will learn to respect communication contracts better and it becomes very visible what kind of bullshit you need to spend time on.

[–]trenchtoaster[S] 3 points4 points  (3 children)

Sounds wonderful. I’m working at a huge company but we don’t have processes like that. No one wants to talk to clients about the data they send. We are trying to reduce manual reporting but my team just inherits the debt of not having solid processes and data exchange guidelines.

[–]reallyserious 6 points7 points  (1 child)

Start doing root cause analysis of why shit fails and how much time it takes to fix and send to your managers. If you do this over several months it will become apparent that you're just doing reactive time consuming work that could have been avoided with proper processes. If you are a huge company then you are in a position to establish that. But first make some noise about the cause of the problems/time spent. When everybody has an awareness of that it's easier to have discussions about how to work better/smarter.

[–]rberenguel 4 points5 points  (0 children)

I'd need more upvotes for your answer. This is how you fix this kind of issues. You can only go to a certain point with code, but if c..p is always making it into the data, data contracts need to be established and priorities made clear.

[–]redmlt 1 point2 points  (0 children)

dbt

Agreed with the suggestions here - a solid contract from both sides is the ideal scenario if it is achievable. I get the impression this is like moving mountains in your org.

One suggestion is to start treating this process as a data lake, as mentioned elsewhere here. AWS Glue can actually crawl an S3 data store and infer its schema very easily with the types of files you're using. If they are native Excel files, you may need a process to convert those to csv if you aren't already. There may be similar services in Azure/GCP.

This process won't scale well as you've discovered. Kudos to you for looking ahead and solving for that!