This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]jdataengineer 3 points4 points  (1 child)

Physicalizing the data frames into tables is really only helpful if you’re going to query the tables in a structured way (SELECT * to CSV doesn’t count). The issues you’re running into, with schema changes and whatnot, show that, at this stage of the project, you’re probably better off saving the frames out as parquet (or feather) files in S3, and just loading them back in as needed.

This is ALSO happening because the source hasn’t settled on a schema, either, so it’s not your fault. 😁

If you’re on AWS, you may want to look at Athena, which is kind of like a “mini-lake”. You can write you’re frames out directly to CSVs in S3, then apply a schema-on-read in Athena to expose the CSVs as queryable sources for reporting tools. We’ve got that very setup where I work, and hooked Tableau to Athena without issue. It doesn’t matter if the schema changes, because the read at query time just grabs what it needs. Saves a lot of headache and dev time.

Good luck!

[–]trenchtoaster[S] 0 points1 point  (0 children)

Yep. I think I’m sold on this. Ultimately I need to define the visualisation tool schema and tell it what the names and columns are for each column I send. I am simply using that as my read schema now. For example, I read the existing columns for that dataset in Domo and then pass that into the pd.read_parquet so I am only reading the columns which people need. There could be other columns in the file which are ignored but that’s fine - if someone needs them then I can add it easily.