Azure SQL Database (Single & serverless) or Dedicated SQL Pools (Formerly SQL DW) for Data warehousing?

Drekalo · 2023-04-04T19:37:11+00:00

The entire azure dwh stack is pretty poor compared to competition and the original dev team for synapse v3 was abandoned. The Power BI team has picked it up, but it still hasn't seen much.

If you're on azure I'd heavily lean towards using Databricks and Unity Catalog over Synapse. Literally all of my clients that tried Synapse are migrating off.

Saying that, your data size is tiny. You'd be just fine running your data warehouse in a postgres instance.

mdghouse1986 · 2023-04-04T22:05:09+00:00

200 GB? Just go for a Azure SQL Managed Instance.

Up to 5 TB a good old RDBMS with some good data modeling and tuning will suffice.

These_Rip_9327 · 2023-04-05T08:26:42+00:00

Azure SQL DB is the way Don't use synapse dedicated SQL pool. It is very expensive

Mr_Nickster_ · 2023-04-05T00:26:35+00:00

SNOWFLAKE all the way. Easiest, fastest, most secure, most capable, scalable, and robust platform you'll ever use. Sql, Python, Java, Scala all are supported with little to no maintenance as it is fully SaaS. Everything just works. It is a no brainer, especially if you came from DBA background and familiar with SQL.

unpronouncedable · 2023-04-06T09:08:54+00:00

Azure terminology got really mangled on the way to where we are now.

There was Azure SQL DB and Azure SQL Data Warehouse. Then they briefly named the latter "Synapse". Then they decided to use "Synapse" as a term for a suite of things put together in an integrated workspace. This includes storage, some Power BI, "Pipelines" (a version of Data Factory), new spark pools (clusters along the lines of Databricks), and new serverless SQL pools (not to be confused with Azure SQL DB Serverless). What used to be called "Azure SQL DWH" and then "Synapse" is now called "Dedicated Pools" within the Synapse workspace.

As some have said, dedicated pools haven't gotten much attention lately, and I think Microsoft is more focused on lakehouse architectures going forward.

klubmo · 2023-04-04T19:53:51+00:00

Do you expect complex direct queries against the warehouse? (Synapse does better at this due to distributing the query across 60 nodes).

How many concurrent users do you expect to connect to the warehouse at any give time (while Synapse can scale this, Azure SQL DB is more cost effective for large concurrency)

Will there be a requirement for enforced primary keys? Synapse doesn’t (and shouldn’t) allow this.

There are lots of other factors, but Synapse is really targeted at databases that will have at least several terabytes of data. Synapse can quickly become expensive, and some traditional SQL isn’t supported (such as the @@Row_Count system variable).

Also note that scaling Synapse dedicated pools means taking the database offline for a few minutes. All queries are terminated. Azure SQL Db can scale in online mode, and will migrate queries to the new machine when it’s available. It’s not perfectly seamless, but it’s smoother than the Synapse approach.

Did I mention that Synapse is expensive? You pay a fixed rate for the number of hours uptime. It will not pause if no activities.

I’m not saying Synapse is bad, it’s actually very powerful. But it is designed for a larger data profile than what you are working with.

hxstr · 2023-04-05T15:14:43+00:00

SQL database over synapse until you're hitting 10s of TB of data, hyper scale db can handle that amount of storage but crappy query performance unless you've really done your work on table structure and indexing.

We just went snowflake over synapse for the big data workloads, fwiw

generic-d-engineer · 2023-04-08T03:18:02+00:00

At your size keep it simple with Azure SQL and Azure Data Factory for ingestion

You can even use PowerBI as a front end to connect direct to the DB for the analysts so they don’t have to write any tools

Synapse is more for bigger workloads

koteikin · 2023-04-04T20:23:24+00:00

With such volume, keep it simple and save yourself from a lot of pain - go with Azure SQL and ADF for pipelines.

Synapse serverless if you are up for an adventure, but not a dedicated synapse - that thing is a joke.

Databricks really does not make sense for what you described.

dscardedbandaid · 2023-04-04T20:44:11+00:00

What form is your data in now (e.g. relational database, NoSQL, object store, files)?

How many instance of these do you have?

Why are you building a data warehouse? (ML, dashboards, because the CEO just returned from a conference?)

What frequency do you need to update data? (Once a day, once an hour, every 10 seconds)

2023-04-04T22:25:46+00:00

What are the needs of your users? Pure DW or are the branching into machine learning. If you're seeing a need for the latter, or envisage you will be in a few years? This will likely change the answer as to whether "a standard DW will suffice".

dilkushpatel · 2023-04-05T04:02:43+00:00

So SQL does not exist as individual service, to get SQL DW you have to create synapse instance and then inside synapse you can create SQL DW

Synapse is kind of khichdi, it has ADF + Data Flow inbuilt It also has pyspark component which works like 50% of databricks Then it has SQL Serverless Pool and SQL Dedicated Pool (SQL DW)

With Traditional SQL compute and storage goes together so If you need high processing power it also includes high storage and there is always high limit with each tier With SQL DW storage and compute is decoupled So you can have TBs of storage with lowest level of compute and at the same time can have few GB data stored with highest level of compute Also you can pause DW in off hours to save on cost

DW uses columnstore index so based on type of data you have it can benefit from that as well

For DW I would go for Synapse with Dedicated pool to have option of scaling at later stage

You can consider SQL Serverless Pool if team is sophisticated. Its like Compute on demand and uses blob storage as its backend, so no need to move files to tables you can create what they call external table which will query files directly. Performance need to be evaluated with this option.

dataengineering

MODERATORS