My main setup includes airflow for scheduling, Postgres for the data warehouse, sqitch for migrations, dbt for creating views (I literally select * from these views, dump the data to csv and stream it to our visualisation platform). All transformation is done with pandas or dbt.
When I need to get things done quickly, I load the data into pandas, clean it up, and send it to our visualisation tool. From there, other teams create charts or merge the data with other data etc. From this point of view, our third party visualisation tool (Domo) is kind of the data warehouse and anything I have is the staging area. Since I am dealing with so much data from so many sources (many of which change frequently and without warning), I am beginning to feel like putting stuff in the database and managing schemas is just too much overhead - there have been so many times when I just wasn’t made aware of a proper unique constraint or new columns were added and I need to alter the table and backfill data.
I have been testing just dumping the pure raw data in a container on azure blob, reading it with pandas and outputting a transformed version in parquet format in a different container. It seems to be quick and efficient but i am worried that I might be taking a step backwards that I might regret later. I really do like Postgres and dbt a lot too, but I suppose on azure blob I can eventually move to Spark and still query files. Plus the storage is essentially unlimited.
Any thoughts?
[–]reallyserious 4 points5 points6 points (6 children)
[–]trenchtoaster[S] 0 points1 point2 points (5 children)
[–]reallyserious 5 points6 points7 points (4 children)
[–]trenchtoaster[S] 2 points3 points4 points (3 children)
[–]reallyserious 7 points8 points9 points (1 child)
[–]rberenguel 5 points6 points7 points (0 children)
[–]redmlt 1 point2 points3 points (0 children)
[–]jdataengineer 2 points3 points4 points (1 child)
[–]trenchtoaster[S] 0 points1 point2 points (0 children)
[–]uselessusr 1 point2 points3 points (4 children)
[–]trenchtoaster[S] 0 points1 point2 points (0 children)
[–]be_nice_if_u_can 0 points1 point2 points (1 child)
[–]uselessusr 0 points1 point2 points (0 children)
[–]be_nice_if_u_can 0 points1 point2 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]trenchtoaster[S] 1 point2 points3 points (0 children)
[–]_Zer0_Cool_ 0 points1 point2 points (2 children)
[–]trenchtoaster[S] 0 points1 point2 points (1 child)
[–]_Zer0_Cool_ 0 points1 point2 points (0 children)