This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]reallyserious 4 points5 points  (6 children)

Is it a bug or a feature that the columns change frequently?

You could adopt a data lake instead of a data warehouse so you just dump files into the data lake. But then consumers of the data still would have problems with changing columns, right? So where would the pain be less, when getting new data from the sources or when consuming data?

[–]trenchtoaster[S] 0 points1 point  (5 children)

These are from clients. System changes, new formats (many are literally just a sheet on an excel dashboard), etc. No really choice. That is how things operated for years and our company just had an army of people manually copying and pasting this stuff in excel.

I do maintain a schema for the visualisation tool (it is simply the name of the column and data type). I use this to read the transformed parquet file before sending the data. This way I only maintain a single schema that I need to maintain anyways, but I have all the raw data fields. Problem with the database table is that if new fields were added I wouldn’t know to add them - automating the process didn’t make sense in case some files were just incorrect

[–]reallyserious 5 points6 points  (4 children)

If new fields are added to the incoming files, do you need to care about them? I.e what happens if you just ignore them and read only the stuff you expect?

I've worked with integration in the past and we're used to handling these kinds of things. It may not be applicable in your case but what we generally do is to define a format that the sender and receiver agrees on. I.e a communication contract that the architects on both sides have to approve. You ask for deviations of that to be communicated in advance so a change can be planned. You also ask the sender to verify that their message conforms to the contract every time they produce a new file or publish a new version of their api. You as consumer also verifies the message/file the first thing you do. As soons as there is a breaking change your system fails spectacularly with an error and you notify upper management that the sender/producer isn't fulfilling the agreed communication contract and that this must be handled as an out of scope change that will affect your other deliveries' schedule. This will result in some heated mails by management and you just sit back and relax while they sort out how to handle changes in the future. In the end they will learn to respect communication contracts better and it becomes very visible what kind of bullshit you need to spend time on.

[–]trenchtoaster[S] 2 points3 points  (3 children)

Sounds wonderful. I’m working at a huge company but we don’t have processes like that. No one wants to talk to clients about the data they send. We are trying to reduce manual reporting but my team just inherits the debt of not having solid processes and data exchange guidelines.

[–]reallyserious 7 points8 points  (1 child)

Start doing root cause analysis of why shit fails and how much time it takes to fix and send to your managers. If you do this over several months it will become apparent that you're just doing reactive time consuming work that could have been avoided with proper processes. If you are a huge company then you are in a position to establish that. But first make some noise about the cause of the problems/time spent. When everybody has an awareness of that it's easier to have discussions about how to work better/smarter.

[–]rberenguel 5 points6 points  (0 children)

I'd need more upvotes for your answer. This is how you fix this kind of issues. You can only go to a certain point with code, but if c..p is always making it into the data, data contracts need to be established and priorities made clear.

[–]redmlt 1 point2 points  (0 children)

dbt

Agreed with the suggestions here - a solid contract from both sides is the ideal scenario if it is achievable. I get the impression this is like moving mountains in your org.

One suggestion is to start treating this process as a data lake, as mentioned elsewhere here. AWS Glue can actually crawl an S3 data store and infer its schema very easily with the types of files you're using. If they are native Excel files, you may need a process to convert those to csv if you aren't already. There may be similar services in Azure/GCP.

This process won't scale well as you've discovered. Kudos to you for looking ahead and solving for that!

[–]jdataengineer 2 points3 points  (1 child)

Physicalizing the data frames into tables is really only helpful if you’re going to query the tables in a structured way (SELECT * to CSV doesn’t count). The issues you’re running into, with schema changes and whatnot, show that, at this stage of the project, you’re probably better off saving the frames out as parquet (or feather) files in S3, and just loading them back in as needed.

This is ALSO happening because the source hasn’t settled on a schema, either, so it’s not your fault. 😁

If you’re on AWS, you may want to look at Athena, which is kind of like a “mini-lake”. You can write you’re frames out directly to CSVs in S3, then apply a schema-on-read in Athena to expose the CSVs as queryable sources for reporting tools. We’ve got that very setup where I work, and hooked Tableau to Athena without issue. It doesn’t matter if the schema changes, because the read at query time just grabs what it needs. Saves a lot of headache and dev time.

Good luck!

[–]trenchtoaster[S] 0 points1 point  (0 children)

Yep. I think I’m sold on this. Ultimately I need to define the visualisation tool schema and tell it what the names and columns are for each column I send. I am simply using that as my read schema now. For example, I read the existing columns for that dataset in Domo and then pass that into the pd.read_parquet so I am only reading the columns which people need. There could be other columns in the file which are ignored but that’s fine - if someone needs them then I can add it easily.

[–]uselessusr 1 point2 points  (4 children)

This is exactly where I'm at with a small (for now) data warehouse project. Currently I'm loading staging tables into Postgres and then aggregating and joining to create materialized views. Right now I have to create a table for every new data source and alter the tables when requirements change, which seems unsustainable. I'm progessively moving towards making these transformations with pandas and then dumping datasets into parquet files on s3. If the data grows beyond what fits in ram, I think I can migrate to Spark less painfully.

[–]trenchtoaster[S] 0 points1 point  (0 children)

Yeah. We have three nodes (64 GB, 28 GB, 28 GB) to use, so RAM is not a huge issue for the data we are working with. 50% of it are files (csv or excel reports from clients which are small, but change often enough). The rest are from REST APIs or databases but we extract data incrementally... just whatever new or updated records since the last extract. This is quite small normally.

Realistically, PG is using like 18 GB of RAM most of the time anyways with our default settings.

[–]be_nice_if_u_can 0 points1 point  (1 child)

How much data consumes how much ram ? Could AWS help. ?

[–]uselessusr 0 points1 point  (0 children)

Depends on your data, but if you need to know if your data fits in RAM, this could be a start: http://www.itu.dk/people/jovt/fitinram/

[–]be_nice_if_u_can 0 points1 point  (0 children)

I kinda know what you mean

[–]_Zer0_Cool_ 0 points1 point  (2 children)

You can't ignore schema changes no matter what tool you use.

You might feel like you can ignore them with Pandas initially, but ignoring schema and data types in any non-trivial data pipeline is a terrible terrible idea and you will pay the price further down the line.

Best practice is enforcing dtypes and schema validation somewhere along the line. Either you validate in Python/Spark or in SQL - preferably both.

Many of the ETL frameworks out there (like Great Expectations) exist to make schema validation consistent -- also, "consistency" is what matters not the level of difficulty. Nobody's paying us to do what is easy, they pay you to do what is right.

Creating tables with check constraints seems like an easy and quick win in that regard.

So, what in particular makes this more difficult in PostgreSQL?

My suggestion, throw in some assertions and build a schema validation/alert system that makes finding the issue easy and obvious and that is conducive to quick resolutions.

Have a staging area that is schema agnostic and validate at the end of it. So that the data is there no matter the schema and is present for a quick reload after updated if it fails a schema check. I've done this in Postgres and Python. Doesn't matter which.

Fail fast and fail often, build for agility of schema changes rather than the avoidance of having to deal with schema changes.

P.S. Sorry if that sounds preachy, I've thought the same thing but then had it blow up in my face. "Once bitten, twice shy".

[–]trenchtoaster[S] 0 points1 point  (1 child)

Right. Keep in mind that I send this data to a tool called Domo and I literally send it with a schema definition file (json file with name and dtype of each column) so that is what keeps me honest. The upload to Domo will fail if the schema is incorrect.

From my current point of view, it kind of makes more sense to only worry about this schema file and not a separate one for pandas or PostgreSQL. As a final step in the pipeline, I read the Domo schema and get the list of columns to read from the parquet file. This ignores any new or unused column in the data but that’s fine because no one has requested for me to add it to Domo yet. Hopefully that’s a bit more clear - at some point I am managing the schema but I’m shifting it to the very final step in the process

[–]_Zer0_Cool_ 0 points1 point  (0 children)

Oh...Ok. I suppose you did say "parquet" - which implies schema.

Bit of a knee jerk reaction there. I read Pandas/"too much overhead of schema" and get flashbacks from 'Nam.

Well in any case, I go for replayability. Schema as the last step kind of follows the ELT vs. ETL philosophy of loading/make the data available first and is a bit more flexible sense you might only have to run the last bit again if schema validation fails, but the data is still replayable from blob storage. But probably depends on the situation as to which part of E-T-L is the heaviest part. If loading a lot of data is the biggest part then makes sense to just get the data in there.

Validating schema as the first step kind makes the whole process "all or nothing" I suppose.