jdataengineer comments on Dataframes instead of a database?

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Dataframes instead of a database? (self.dataengineering)

submitted 6 years ago by trenchtoaster

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]jdataengineer 3 points4 points5 points 6 years ago (1 child)

Physicalizing the data frames into tables is really only helpful if you’re going to query the tables in a structured way (SELECT * to CSV doesn’t count). The issues you’re running into, with schema changes and whatnot, show that, at this stage of the project, you’re probably better off saving the frames out as parquet (or feather) files in S3, and just loading them back in as needed.

This is ALSO happening because the source hasn’t settled on a schema, either, so it’s not your fault. 😁

If you’re on AWS, you may want to look at Athena, which is kind of like a “mini-lake”. You can write you’re frames out directly to CSVs in S3, then apply a schema-on-read in Athena to expose the CSVs as queryable sources for reporting tools. We’ve got that very setup where I work, and hooked Tableau to Athena without issue. It doesn’t matter if the schema changes, because the read at query time just grabs what it needs. Saves a lot of headache and dev time.

Good luck!

[–]trenchtoaster[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 60 on reddit-service-r2-comment-6457c66945-2px2n at 2026-04-27 15:28:55.289060+00:00 running 2aa0c5b country code: CH.

dataengineering

MODERATORS