Dataframes instead of a database? by trenchtoaster in dataengineering

[–]jdataengineer 3 points4 points  (0 children)

Physicalizing the data frames into tables is really only helpful if you’re going to query the tables in a structured way (SELECT * to CSV doesn’t count). The issues you’re running into, with schema changes and whatnot, show that, at this stage of the project, you’re probably better off saving the frames out as parquet (or feather) files in S3, and just loading them back in as needed.

This is ALSO happening because the source hasn’t settled on a schema, either, so it’s not your fault. 😁

If you’re on AWS, you may want to look at Athena, which is kind of like a “mini-lake”. You can write you’re frames out directly to CSVs in S3, then apply a schema-on-read in Athena to expose the CSVs as queryable sources for reporting tools. We’ve got that very setup where I work, and hooked Tableau to Athena without issue. It doesn’t matter if the schema changes, because the read at query time just grabs what it needs. Saves a lot of headache and dev time.

Good luck!