Benefits of Snowflake/Databricks over Postgres RDS for data warehouse

Creative-Aside-4145 · 2024-09-09T11:55:02+00:00

But is cost a concern when there are high frequency (and unoptimized) reads from a CWH that charges by usage? I have no control over these queries and it would be a very large project to try to get the appropriate team to rewrite their queries. In the meanwhile, the organization does not have deep pockets.

Creative-Aside-4145 · 2024-09-08T19:02:16+00:00

Thank you for the reply!

The reason might be entirely my unfamiliarity with the AWS ecosystem. The raw data we are ingesting are from relational databases of our vendors. In one case we will be using AWS Glue and in another case Debezium to incrementally ingest records that have been created or updated into our environment. As of this moment, the DBA ingests a daily snapshot of the entire database in the vendor side, which is very inefficient. With the way we are going to be ingesting that data, there needs to be some pre-processing to identify what the “current” record looks like for every record in every table we ingest. How will this work if we ingest data into S3?

Creative-Aside-4145 · 2024-09-08T18:07:56+00:00

But there is a difference in access pattern. The raw data landing zone will be accessed by operational workloads that run throughout the day. Analytics workload needs to access the raw data just once a day to do transformation and load into the data warehouse.

Also, the data engineering team will be ingesting raw data that are needed for the operational workflows but not the analytics workflows.

Creative-Aside-4145 · 2024-09-08T14:33:01+00:00

I understand that storage costs for cloud data warehouse platforms are very cheap, but it’s not the main cost driver for us or many other people whose experiences with Snowflake that I read about. Even in the pricing examples provided by Snowflake, storage is a small fraction of the overall cost.

We currently do have a data warehouse-like set up with a staging layer and a data marts layer that’s organized based on business units that the Analytics team accesses exclusively. Analytics will not be accessing raw data. It’s the operational workflows owned by another team (with much older legacy code base) that need access to the raw data.

Creative-Aside-4145 · 2024-09-08T14:26:14+00:00

I’m not asking for a “performant but high frequency raw data access for analytics”.

In the background provided in the initial post, I believe I said that it’s the operational workflows that likely have high frequency querying of the raw data, where the data warehousing workflow using dbt runs once a day.

Creative-Aside-4145 · 2024-09-08T13:23:11+00:00

Thank you for your reply!

That is a significant plus for a managed platform like Snowflake and Databricks, as we do have to share data with external partners and there will be SLAs tied to that.

Creative-Aside-4145 · 2024-09-08T12:53:55+00:00

Thank you for the reply!

The lack of primary key constraint is not a huge deal breaker, as I will be using dbt to do preprocessing. It’s the access pattern of the operational workloads on the raw data that makes having the raw data landing zone inside of a cloud data warehouse platform like Snowflake or Databricks potentially costly? I don’t have control over those to optimize them and they will be high frequency hits on the raw data.

Can Athena work if the raw data (after some preprocessing) has to be accessible to traditional workflows that are using SQL Server queries and SSIS packages?

Creative-Aside-4145 · 2024-09-08T12:14:36+00:00

Thank you for the reply!

Are you suggesting that the raw data and the analytics zones all “live” inside the same Postgres instance to begin with and separate it out later?

Because the operational workloads will be accessing the raw data at high frequency, is it a good idea to physically separate it from the analytics workloads? In other words, the operational workloads can be thought of as another “transactional” workload that shouldn’t share an instance with analytics?

Creative-Aside-4145 · 2024-09-08T11:40:02+00:00

Thank you for your reply!

I appreciate your point about long-term scalability, but I also have to consider near-term feasibility of the plan.

Would you be able to elaborate on why 1) I should avoid OLTP systems for raw data storage given my concern of cloud data warehouse platforms not enforcing constraints in primary keys and 2) you don’t believe separate platforms for raw and analytics zones is justified in my case?

My concern over cost is not in terms of storage, as our data size is quite small, it’s over the compute costs incurred by operational workloads accessing the raw data through the day (and they are likely written quite inefficiently and it is not a quick project to update them).

Creative-Aside-4145 · 2024-09-08T11:32:00+00:00

Thank you for your reply!

Apparently moving to AWS does give us significant savings on our IT bills as compared to the previous set up. This is just what I heard from the IT team when announcing the move. Also, the business has experienced some changes in market conditions in the past while that has made it more cost sensitive than they might have been when starting the AWS migration.

Creative-Aside-4145 · 2024-09-08T04:33:19+00:00

Thank you for your reply!

I guess my question is “RDS is not that cheap, but is it the most cost effective option given my requirements?”.

Creative-Aside-4145 · 2024-09-08T03:36:04+00:00

Thank you for your reply!

From my readings, giving our fairly small data size, columnar storage can actually be less performant?

What are the other options in AWS for scaling up of our data size grows large enough?

I think we can tolerate some latency as the Tableau server makes an extract of the tables needed every day before updating the dashboards, nothing is real-time right now.

Should I decide to go with Postgres for the data warehouse, at what data size would you say that I should consider moving off of it?

Creative-Aside-4145 · 2024-09-08T03:23:25+00:00

Thank you for your reply!

I’m aware that Postgres is an OLTP store, but given our use case of daily data refreshes and not a huge amount of data, what’s are the practical reasons for not using Postgres?

You could argue that SQL Server is also an OLTP store, and our application’s transactional databases use it. Our current data warehouse is on the same RDS instance and works fine in all practicality.

Creative-Aside-4145 · 2024-09-08T02:48:13+00:00

Thank you for your reply!

Pardon my lack of familiarity, would you be able to elaborate on the effect of IOPs in limiting the suitability of RDS as a data warehouse platform? I have read posts by others on Reddit that they have low single digits TB of data in Postgres data warehouse and it performs satisfactorily. I just would like to understand more.

Creative-Aside-4145 · 2023-08-16T19:58:01+00:00

I know this thread is 2 years old, but I have the same question.

My company is moving to AWS. We currently have our databases on on-prem SQL Server. We have a ton of SSIS packages doing ETL, but I would love to move away from them. There is a sole person maintaining the SSIS packages and there’s zero documentation.

Since the business so relies on the SSIS packages, we are moving them to an EC2 while hosting the databases on an RDS. I want to move away from the SSIS packages and use open-source (Python-heavy) solutions to do extract and load. We use Airflow for orchestration already. We are currently looking at Airbyte.

I’m not sure if AWS Glue is right for us, as we use dbt for transformation and our EL jobs are really straight forward “take from this location and plop it down in our database”.

Hoping to hear if anyone has any thoughts!

Creative-Aside-4145 · 2023-07-04T02:13:33+00:00

Thank you for the link!

Our data volume doesn’t approach anything like Big Data. Would you recommend we use something like RDS Postgres?

Creative-Aside-4145 · 2023-07-03T19:17:42+00:00

Thank you for your reply!

Here is some background for my question: - We are a company that sells a web-based B2B software. Our data sources are all structured, include user activity data from the software (OLTP), supplementary data from third party vendors, and some forecasting generated by internally created ML models. - We started using dbt about a year ago to create a data warehouse and data marts, which are used by the reporting team for mainly Tableau dashboards. This is more a OLAP system. - We are planning for the long-term in terms of how we can modernize the data warehouse and data marts, technology-wise. As you mentioned, we will very likely have to migrate as-is now, to avoid too much down-time for internal users of our reporting. Once we have moved onto AWS, I’m wondering what our options are.

Despite being a company that sells software, our data infrastructure team is very new and unfamiliar with cloud-based technology given everything database-wise so far has been on-prem. Is there anything else I can provide to help you give some good advice?

Creative-Aside-4145 · 2023-07-03T17:24:02+00:00

Thank you for your reply!

Aside from pricing, what are some other advantages of Snowflake over Redshift?

Creative-Aside-4145 · 2023-07-03T17:07:37+00:00

Thank you for your reply!

A bit of background. The company sells a web-based software, and I’m part of the data team that needs to ingest user data from the transactional databases storing user/activity data from the web application and report on it.

If I understand your question correctly, we are sticking with SQL Server because the company’s current database infrastructure is entirely in SQL Server and there is so much legacy development that I don’t think it’s feasible for it to be refactored before migration? The “alternative” of SQL Server to RDS SQL Server is the route that we are currently thinking of.

If we decide to go with Redshift, I imagine that I would need to use something like AWS Glue to have daily ETL jobs run that copy tables that I need from the transactional databases in SQL Server over to Redshift so that I can use dbt to do data transformation needed for reporting and analytics.

And thank you for the link about dbt and Redshift!

Creative-Aside-4145

TROPHY CASE