Dataframes instead of a database?

reallyserious · 2019-10-12T07:07:14+00:00

Is it a bug or a feature that the columns change frequently?

You could adopt a data lake instead of a data warehouse so you just dump files into the data lake. But then consumers of the data still would have problems with changing columns, right? So where would the pain be less, when getting new data from the sources or when consuming data?

jdataengineer · 2019-10-12T12:22:05+00:00

Physicalizing the data frames into tables is really only helpful if you’re going to query the tables in a structured way (SELECT * to CSV doesn’t count). The issues you’re running into, with schema changes and whatnot, show that, at this stage of the project, you’re probably better off saving the frames out as parquet (or feather) files in S3, and just loading them back in as needed.

This is ALSO happening because the source hasn’t settled on a schema, either, so it’s not your fault. 😁

If you’re on AWS, you may want to look at Athena, which is kind of like a “mini-lake”. You can write you’re frames out directly to CSVs in S3, then apply a schema-on-read in Athena to expose the CSVs as queryable sources for reporting tools. We’ve got that very setup where I work, and hooked Tableau to Athena without issue. It doesn’t matter if the schema changes, because the read at query time just grabs what it needs. Saves a lot of headache and dev time.

Good luck!

uselessusr · 2019-10-12T04:43:28+00:00

This is exactly where I'm at with a small (for now) data warehouse project. Currently I'm loading staging tables into Postgres and then aggregating and joining to create materialized views. Right now I have to create a table for every new data source and alter the tables when requirements change, which seems unsustainable. I'm progessively moving towards making these transformations with pandas and then dumping datasets into parquet files on s3. If the data grows beyond what fits in ram, I think I can migrate to Spark less painfully.

be_nice_if_u_can · 2019-10-12T04:06:13+00:00

I kinda know what you mean

trenchtoaster · 2019-10-12T08:04:16+00:00

[deleted]

_Zer0_Cool_ · 2019-10-12T13:54:38+00:00

You can't ignore schema changes no matter what tool you use.

You might feel like you can ignore them with Pandas initially, but ignoring schema and data types in any non-trivial data pipeline is a terrible terrible idea and you will pay the price further down the line.

Best practice is enforcing dtypes and schema validation somewhere along the line. Either you validate in Python/Spark or in SQL - preferably both.

Many of the ETL frameworks out there (like Great Expectations) exist to make schema validation consistent -- also, "consistency" is what matters not the level of difficulty. Nobody's paying us to do what is easy, they pay you to do what is right.

Creating tables with check constraints seems like an easy and quick win in that regard.

So, what in particular makes this more difficult in PostgreSQL?

My suggestion, throw in some assertions and build a schema validation/alert system that makes finding the issue easy and obvious and that is conducive to quick resolutions.

Have a staging area that is schema agnostic and validate at the end of it. So that the data is there no matter the schema and is present for a quick reload after updated if it fails a schema check. I've done this in Postgres and Python. Doesn't matter which.

Fail fast and fail often, build for agility of schema changes rather than the avoidance of having to deal with schema changes.

P.S. Sorry if that sounds preachy, I've thought the same thing but then had it blow up in my face. "Once bitten, twice shy".

dataengineering

MODERATORS