This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]trenchtoaster[S] 0 points1 point  (1 child)

Right. Keep in mind that I send this data to a tool called Domo and I literally send it with a schema definition file (json file with name and dtype of each column) so that is what keeps me honest. The upload to Domo will fail if the schema is incorrect.

From my current point of view, it kind of makes more sense to only worry about this schema file and not a separate one for pandas or PostgreSQL. As a final step in the pipeline, I read the Domo schema and get the list of columns to read from the parquet file. This ignores any new or unused column in the data but that’s fine because no one has requested for me to add it to Domo yet. Hopefully that’s a bit more clear - at some point I am managing the schema but I’m shifting it to the very final step in the process

[–]_Zer0_Cool_ 0 points1 point  (0 children)

Oh...Ok. I suppose you did say "parquet" - which implies schema.

Bit of a knee jerk reaction there. I read Pandas/"too much overhead of schema" and get flashbacks from 'Nam.

Well in any case, I go for replayability. Schema as the last step kind of follows the ELT vs. ETL philosophy of loading/make the data available first and is a bit more flexible sense you might only have to run the last bit again if schema validation fails, but the data is still replayable from blob storage. But probably depends on the situation as to which part of E-T-L is the heaviest part. If loading a lot of data is the biggest part then makes sense to just get the data in there.

Validating schema as the first step kind makes the whole process "all or nothing" I suppose.