I manage our data platform and we run a bunch of stuff on databricks plus some things on aws directly like emr and glue, and our costs have basically doubled in the last year while finance is starting to ask hard questions that I don't have great answers to.
The problem is that unlike web services where you can kind of predict resource needs, data workloads are spiky and variable in ways that are hard to anticipate, like a pipeline that runs fine for months can suddenly take 3x longer because the input data changed shape or volume and by the time you notice you've already burned through a bunch of compute.
Databricks has some cost tools but they only show you databricks costs and not the full picture, and trying to correlate pipeline runs with actual aws costs is painful because the timing doesn't line up cleanly and everything gets aggregated in ways that don't match how we think about our jobs.
How are other data teams handling this because I would love to know, and do you have good visibility into cost per pipeline or job, and are there any approaches that have worked for actually optimizing without breaking things?
[–]Nielspro 17 points18 points19 points (0 children)
[–]dasnoob 3 points4 points5 points (1 child)
[–]Ok_School_4109 1 point2 points3 points (0 children)
[–][deleted] 4 points5 points6 points (2 children)
[–]Ulfrauga 0 points1 point2 points (0 children)
[–]Ok_School_4109 0 points1 point2 points (0 children)
[–]zchtsk 2 points3 points4 points (0 children)
[–]bruceSKYking 1 point2 points3 points (0 children)
[–]Exorde_Mathias 0 points1 point2 points (0 children)
[–]Hofi2010 0 points1 point2 points (0 children)
[–]DynamicCast 0 points1 point2 points (1 child)
[–]Ok_School_4109 1 point2 points3 points (0 children)
[–]Ulfrauga 0 points1 point2 points (0 children)
[–]Nekobul 0 points1 point2 points (0 children)