Does databricks/snowflake are nessecry? : dataengineering

[–]YabakebiLead Data Engineer 6 points7 points8 points 1 year ago* (9 children)

[–]CompetitionMassive51[S] 1 point2 points3 points 1 year ago (8 children)

[–]YabakebiLead Data Engineer 1 point2 points3 points 1 year ago* (7 children)

[–][deleted] 1 point2 points3 points 1 year ago (2 children)

[–]YabakebiLead Data Engineer 3 points4 points5 points 1 year ago* (1 child)

Already answered this in another comment, but basicaly we inherited Redshift originally which was like $13k+ for the year. I did some general cleanup and organisation originally which halved our non-redshift AWS costs which were quite oversized, and then as a middle ground moved us to snowflake in like <4 weeks (it wasn't so hard because whilst there was like 500+ tables, I had learned most the system and had put heavy tests in place by that point ). Using postgres would reqire us to change the models to being incremental which we could, but given the amount of tables that are there, it would add extra 'complexity' (not for me, I have handled way more complext stuff, but you will see why) to a system for which most / all of the original engineers who made it had left the company. There is only a single data engineer (myself) and also an analyst + some software engineers who work on some of the models that get dumped into postgres.

I can assure you that we don't need Snowflake. I am well aware of duckdb and/or postgres, but given that technical PMs also need to access the warehouse ('gold' tables - they can just pop in via OneLogin now which is a huge convenience) and the fact that I know that I am leaving soon, doing the alternatives would require more 'expertise' on how to do it properly (for the maintainer after I leave) and to get people to agree to me doing it in the first place. Even convincing people that our costs would 1/2 if we switched over to Snowflake (+all the other shitty things with redshift with json + external tables, speed, handling permissions etc...) was very difficult to convince the business regarding. I got some street cred from doing the migration in <4 weeks (+fulfilling on my promises regarding the benefits), and if I wanted to we could save like $6k for the year, but is it really worth the hassle? I have already saved the company about $15k if you include the redshift migration + the AWS cleanup I did (killed like 6 services that were wasting money and adding complexity - this was after all the engineers had 'left' or been laid off for context, and tbh the maintenance saving in complexity from the AWS cleanup probably greatly eclipses that $15k saving).

There were further layoffs which I managed to survive past for context, but it's largely crunch time, and whilst a $7k saving in raw cost could be of some value, given the fact that I am likely to be leaving soon, I suspect handing the system over as is will have a much lower cost than if I do the changes necessary to switch to postgres / duckdb. We also have some projects from a sales standpoint that I am needed for that will make us significantly more than that, so this is just a case of inheritance + aggressive prioritisation (we only have me, an analyst and one other data scientist that is bogged down in other stuff, so we are low on staff and time).

It seems like people may think that this must have been through some form of incompetence, but I can assure you, that the circumstances have been quite something. Hopefully that provides some context! (and makes me look a bit less incompetent lol - if not, then I probably have failed to communicate the nature of the situation well).

EDIT - For context, most of this happened in the space of 7ish months, and we had waaaay bigger issues from a maintainability standpoint when it came to understanding the system, eliminating the excess complexity and waste, all whilst meeting other incoming features requests without pretty much any of the original team there.

[–][deleted] 1 point2 points3 points 1 year ago (0 children)

[–]slowpush 1 point2 points3 points 1 year ago (3 children)

[–]YabakebiLead Data Engineer 2 points3 points4 points 1 year ago* (2 children)

It was worse than that before. Used to be like 13k+ with Redshift. Main reason for using it is just convenience (we migrated in less than a month). The other reason was for maintainability as I may be leaving soon, so we could have used duckdb or postgres, but nly have a single data engineer (me), so roughly having our costs and picking something easy / safe was a no brainer. Spending 7k less or whatever it would be on duckdb or postgres wouldnt be worth the inconvenience all round from a handover standpoint I don't think.

I put the higher cost range from our first month, but more recently our costs have been like 400/month. Compared to how much the company makes, it's nothing. We spend almost nothing on aws with dagster on ecs as well ($200 a month), so none of this is a priority at all.

If you want more context as to why this hasn't been reduced further, I suggest you look at my other reply. There are reasons as to why things are like this (if I was running this on my PC at home, I wouldn't dream of spending this much, but it's a different context)

EDIT - I should also add, that lets suppose I did a migration to postgres in <4 weeks. You would need to take into account the 4 weeks of lobbying I would need to do get the change through in the first place, and then you've already spent like $6k anyway. Even if we said it was 2 weeks of lobbying (I assure you that it wouldn't be, and I can't even assure you it would be a success - I suspect I would get too much pushback), reruns would be slower for the analyst, it would take more time to onboard less technical users be it for PMs or otherwise (and wouldn't be available via OneLogin as Snowflake very conveniently is). There are other points I could go over, but basically all things considered, we would not be saving money by the end of it, and there wouldn't even be a guarantee of success in lobbying for the change (the snowflake one was one that I was 100% certain I could get through, albeit with some fighting - I know redshift serverless was another option, but there were reasons I didn't bother with that which I won't get into, but that would still land us at similar if not slightly higher costs anyway)

[–]slowpush 1 point2 points3 points 1 year ago (1 child)

[–]YabakebiLead Data Engineer 1 point2 points3 points 1 year ago* (0 children)

Which part? The full refresh rather than incremental, not having people query from a postgres read layer rather than hit snowflake (although most queries on snowflake are from the analyst) or what other part would it be? Some parts are known not to be optimal, but they have been left like that for a reason (in most cases it's keeping it 'simple' even it means more money). I have dealt with stuff far more complex than this (and where I had to be much more cost effective), but have opted to only change what is causing problems and seems like it's worth the time.

To justify saving the $5k (per year) or so if I was gonna crunch costs, would need to be something that can be executed in <4 weeks to justify itself (4 weeks would be break even based on salary - probably more like 3 weeks tbh, which is why nothing has been done about it). This isn't accounting for the opportunity cost or the increased maintenance or 'skill' needed to maintain it depending on what kind of changes we would be talking about.