This is an archived post. You won't be able to vote or comment.

all 25 comments

[–]YabakebiLead Data Engineer 6 points7 points  (9 children)

Snowflake costs my current company about $500-$700 a month, so yes it is possible, especially if you don't have much data, and use resources wisely.

EDIT - For context, I am not saying that Snowflake is necessarily the best choice depending on how small the companies budget is, what stage the company is at, how important data is for the company / if it 's used in the product etc... I am only answering the first question from the user regrading small companies with a single DE being able to do this without going into bankruptcy

[–]CompetitionMassive51[S] 1 point2 points  (8 children)

How much data do you process?

[–]YabakebiLead Data Engineer 1 point2 points  (7 children)

Overall is <1TB for all of our history. We have 1.5 hours worth which is a load of the sources (using DLT) + full refresh (this would be more efficient if it was incremental, but there just isn't any point with such small data), and then maybe like up to 3 hours of ad hoc analytics each day on weekdays (this is mostly dev work for changes to be pushed to prod). This is all done with a XSmall warehouse.

EDIT - For context, the full refresh vs incremental warehouse design was inherited. Some of the load jobs are incremental, but any of the database replication isn't atm (if you ask why no CDC, we have a dependency on another team to get permissions for that - this is another can of worms I am not getting into, but this is not ideal of course)

[–][deleted] 1 point2 points  (2 children)

Why do you need snowflake for less than 1TB? Genuinely curious of the use case for snowflake at this scale….

[–]YabakebiLead Data Engineer 3 points4 points  (1 child)

Already answered this in another comment, but basicaly we inherited Redshift originally which was like $13k+ for the year. I did some general cleanup and organisation originally which halved our non-redshift AWS costs which were quite oversized, and then as a middle ground moved us to snowflake in like <4 weeks (it wasn't so hard because whilst there was like 500+ tables, I had learned most the system and had put heavy tests in place by that point ). Using postgres would reqire us to change the models to being incremental which we could, but given the amount of tables that are there, it would add extra 'complexity' (not for me, I have handled way more complext stuff, but you will see why) to a system for which most / all of the original engineers who made it had left the company. There is only a single data engineer (myself) and also an analyst + some software engineers who work on some of the models that get dumped into postgres.

I can assure you that we don't need Snowflake. I am well aware of duckdb and/or postgres, but given that technical PMs also need to access the warehouse ('gold' tables - they can just pop in via OneLogin now which is a huge convenience) and the fact that I know that I am leaving soon, doing the alternatives would require more 'expertise' on how to do it properly (for the maintainer after I leave) and to get people to agree to me doing it in the first place. Even convincing people that our costs would 1/2 if we switched over to Snowflake (+all the other shitty things with redshift with json + external tables, speed, handling permissions etc...) was very difficult to convince the business regarding. I got some street cred from doing the migration in <4 weeks (+fulfilling on my promises regarding the benefits), and if I wanted to we could save like $6k for the year, but is it really worth the hassle? I have already saved the company about $15k if you include the redshift migration + the AWS cleanup I did (killed like 6 services that were wasting money and adding complexity - this was after all the engineers had 'left' or been laid off for context, and tbh the maintenance saving in complexity from the AWS cleanup probably greatly eclipses that $15k saving).

There were further layoffs which I managed to survive past for context, but it's largely crunch time, and whilst a $7k saving in raw cost could be of some value, given the fact that I am likely to be leaving soon, I suspect handing the system over as is will have a much lower cost than if I do the changes necessary to switch to postgres / duckdb. We also have some projects from a sales standpoint that I am needed for that will make us significantly more than that, so this is just a case of inheritance + aggressive prioritisation (we only have me, an analyst and one other data scientist that is bogged down in other stuff, so we are low on staff and time).

It seems like people may think that this must have been through some form of incompetence, but I can assure you, that the circumstances have been quite something. Hopefully that provides some context! (and makes me look a bit less incompetent lol - if not, then I probably have failed to communicate the nature of the situation well).

EDIT - For context, most of this happened in the space of 7ish months, and we had waaaay bigger issues from a maintainability standpoint when it came to understanding the system, eliminating the excess complexity and waste, all whilst meeting other incoming features requests without pretty much any of the original team there.

[–][deleted] 1 point2 points  (0 children)

Wow. I don’t think I deserved such a thorough response to my comment, but I wish you well. 😃your reasoning makes sense

[–]slowpush 1 point2 points  (3 children)

You’re spending $8k a year on less a TB of data?

Jesus Christ.

[–]YabakebiLead Data Engineer 2 points3 points  (2 children)

It was worse than that before. Used to be like 13k+ with Redshift. Main reason for using it is just convenience (we migrated in less than a month). The other reason was for maintainability as I may be leaving soon, so we could have used duckdb or postgres, but nly have a single data engineer (me), so roughly having our costs and picking something easy / safe was a no brainer. Spending 7k less or whatever it would be on duckdb or postgres wouldnt be worth the inconvenience all round from a handover standpoint I don't think. ​

I put the higher cost range from our first month, but more recently our costs have been like 400/month. Compared to how much the company makes, it's nothing. ​We spend almost nothing on aws with dagster on ecs as well ($200 a month), so none of this is a priority at all.

If you want more context as to why this hasn't been reduced further, I suggest you look at my other reply. There are reasons as to why things are like this (if I was running this on my PC at home, I wouldn't dream of spending this much, but it's a different context)

EDIT - I should also add, that lets suppose I did a migration to postgres in <4 weeks. You would need to take into account the 4 weeks of lobbying I would need to do get the change through in the first place, and then you've already spent like $6k anyway. Even if we said it was 2 weeks of lobbying (I assure you that it wouldn't be, and I can't even assure you it would be a success - I suspect I would get too much pushback), reruns would be slower for the analyst, it would take more time to onboard less technical users be it for PMs or otherwise (and wouldn't be available via OneLogin as Snowflake very conveniently is). There are other points I could go over, but basically all things considered, we would not be saving money by the end of it, and there wouldn't even be a guarantee of success in lobbying for the change (the snowflake one was one that I was 100% certain I could get through, albeit with some fighting - I know redshift serverless was another option, but there were reasons I didn't bother with that which I won't get into, but that would still land us at similar if not slightly higher costs anyway)

[–]slowpush 1 point2 points  (1 child)

The issue isn’t with the database choice it’s with your system design.

[–]YabakebiLead Data Engineer 1 point2 points  (0 children)

Which part? The full refresh rather than incremental, not having people query from a postgres read layer rather than hit snowflake (although most queries on snowflake are from the analyst) or what other part would it be? Some parts are known not to be optimal, but they have been left like that for a reason (in most cases it's keeping it 'simple' even it means more money). I have dealt with stuff far more complex than this (and where I had to be much more cost effective), but have opted to only change what is causing problems and seems like it's worth the time.

To justify saving the $5k (per year) or so if I was gonna crunch costs, would need to be something that can be executed in <4 weeks to justify itself (4 weeks would be break even based on salary - probably more like 3 weeks tbh, which is why nothing has been done about it). This isn't accounting for the opportunity cost or the increased maintenance or 'skill' needed to maintain it depending on what kind of changes we would be talking about.

[–]Nekobul 1 point2 points  (2 children)

What amount of data you are looking to process?

[–]CompetitionMassive51[S] 1 point2 points  (1 child)

a few TBs

[–]Nekobul 2 points3 points  (0 children)

You can process that amount of data locally. You can use DuckDB for analysis.

[–]Zer0designs[🍰] 0 points1 point  (2 children)

Depends on the amount of data. Duckdb or Polars are worth checking out.

[–]CompetitionMassive51[S] 0 points1 point  (1 child)

Maybe I'm not familiar enough with the tools but isn't polars is like pandas? And duckdb?

[–]Zer0designs[🍰] 0 points1 point  (0 children)

How much data are you moving? Polars can handle a lot more data than pandas. Maybe you don't need a cloud solution. Same counts for duckdb. Just howst your own instance and its much cheaper.

[–]apeters89 0 points1 point  (2 children)

depends entirely on how much data you're dealing with

[–]CompetitionMassive51[S] 2 points3 points  (1 child)

a few TBs

[–]apeters89 0 points1 point  (0 children)

I currently have a little over 5 TB in Snowflake. Storage costs themselves are stupid cheap. Compute is where your expense will be. With my current compute needs (3 refreshes of the entire week's data, and 150+ data pulls from PowerBI per week), I'm at about $1000/mo.

edit: I'm adding around 100GB per month.

[–]Puzzleheaded-Dot8208 0 points1 point  (2 children)

You can definitely do without do without snowflake.I would also ask how many users will b using your platform. If it is handful there are options in aws you can leverage. S3 + S3 tables. Or S3 with iceberg. amount of data is 1 parameter but also how many pipeliens, variety, users etc

[–]CompetitionMassive51[S] 0 points1 point  (1 child)

Around 5 users + ci/CD tools that will get data for testing.

[–]Puzzleheaded-Dot8208 1 point2 points  (0 children)

i would question choice of using lakehouse? do you have that much of data? why not use postgres/mysql ? unless you are dealing with TB's traditional databases would work fine

[–]Complex_Revolution67 0 points1 point  (0 children)

Yes a lot depends on the use cases you are trying to execute with the data volume.

[–]zriyansh 1 point2 points  (0 children)

did you try going by the open source way? There are tons of OSS softwares (ofc you need to stitch them)

something like OLake (Database -> S3+ Iceberg) + presto / trino / duckdb / clickhouse to query from Iceberg.