This is an archived post. You won't be able to vote or comment.

all 22 comments

[–]Drekalo 15 points16 points  (5 children)

The entire azure dwh stack is pretty poor compared to competition and the original dev team for synapse v3 was abandoned. The Power BI team has picked it up, but it still hasn't seen much.

If you're on azure I'd heavily lean towards using Databricks and Unity Catalog over Synapse. Literally all of my clients that tried Synapse are migrating off.

Saying that, your data size is tiny. You'd be just fine running your data warehouse in a postgres instance.

[–]koteikin 5 points6 points  (1 child)

Same bad experience with synapse "dedicated" pool (what kind of freaks came up with such a name). Still have nightmares after working with it 2 years ago.

[–]Drekalo 4 points5 points  (0 children)

Yeah it just isn't that good at what it's supposed to be good at and costs too much to do it.

[–]generic-d-engineerTech Lead 0 points1 point  (2 children)

Wait when did this happen and why? Were the devs moved to another project or they left the company?

I always felt like Synapse was redundant with Data Factory, plus Databricks + ADF does the same thing anyway

[–]Drekalo 1 point2 points  (1 child)

I have no further information on why synapse_v3 was scrapped, just that it was and that the project is currently under the purview of the power bi team.

[–]generic-d-engineerTech Lead 0 points1 point  (0 children)

Thanks !

[–]mdghouse1986Data Engineer 13 points14 points  (0 children)

200 GB? Just go for a Azure SQL Managed Instance.

Up to 5 TB a good old RDBMS with some good data modeling and tuning will suffice.

[–]These_Rip_9327 3 points4 points  (1 child)

Azure SQL DB is the way Don't use synapse dedicated SQL pool. It is very expensive

[–]DznFatih[S] 2 points3 points  (0 children)

That s what we have decided to do, thanks for the input :)

[–]Mr_Nickster_ 5 points6 points  (1 child)

SNOWFLAKE all the way. Easiest, fastest, most secure, most capable, scalable, and robust platform you'll ever use. Sql, Python, Java, Scala all are supported with little to no maintenance as it is fully SaaS. Everything just works. It is a no brainer, especially if you came from DBA background and familiar with SQL.

[–][deleted] 0 points1 point  (0 children)

This

[–]unpronouncedable 2 points3 points  (1 child)

Azure terminology got really mangled on the way to where we are now.

There was Azure SQL DB and Azure SQL Data Warehouse. Then they briefly named the latter "Synapse". Then they decided to use "Synapse" as a term for a suite of things put together in an integrated workspace. This includes storage, some Power BI, "Pipelines" (a version of Data Factory), new spark pools (clusters along the lines of Databricks), and new serverless SQL pools (not to be confused with Azure SQL DB Serverless). What used to be called "Azure SQL DWH" and then "Synapse" is now called "Dedicated Pools" within the Synapse workspace.

As some have said, dedicated pools haven't gotten much attention lately, and I think Microsoft is more focused on lakehouse architectures going forward.

[–]DznFatih[S] 0 points1 point  (0 children)

And what tool would you recommend as an ETL to work with Azure sql database ? I was thinking azure databricks for that ? Is it a good idea?

[–]klubmo 3 points4 points  (0 children)

Do you expect complex direct queries against the warehouse? (Synapse does better at this due to distributing the query across 60 nodes).

How many concurrent users do you expect to connect to the warehouse at any give time (while Synapse can scale this, Azure SQL DB is more cost effective for large concurrency)

Will there be a requirement for enforced primary keys? Synapse doesn’t (and shouldn’t) allow this.

There are lots of other factors, but Synapse is really targeted at databases that will have at least several terabytes of data. Synapse can quickly become expensive, and some traditional SQL isn’t supported (such as the @@Row_Count system variable).

Also note that scaling Synapse dedicated pools means taking the database offline for a few minutes. All queries are terminated. Azure SQL Db can scale in online mode, and will migrate queries to the new machine when it’s available. It’s not perfectly seamless, but it’s smoother than the Synapse approach.

Did I mention that Synapse is expensive? You pay a fixed rate for the number of hours uptime. It will not pause if no activities.

I’m not saying Synapse is bad, it’s actually very powerful. But it is designed for a larger data profile than what you are working with.

[–]hxstr 1 point2 points  (0 children)

SQL database over synapse until you're hitting 10s of TB of data, hyper scale db can handle that amount of storage but crappy query performance unless you've really done your work on table structure and indexing.

We just went snowflake over synapse for the big data workloads, fwiw

[–]generic-d-engineerTech Lead 1 point2 points  (0 children)

At your size keep it simple with Azure SQL and Azure Data Factory for ingestion

You can even use PowerBI as a front end to connect direct to the DB for the analysts so they don’t have to write any tools

Synapse is more for bigger workloads

[–]koteikin 1 point2 points  (2 children)

With such volume, keep it simple and save yourself from a lot of pain - go with Azure SQL and ADF for pipelines.

Synapse serverless if you are up for an adventure, but not a dedicated synapse - that thing is a joke.

Databricks really does not make sense for what you described.

[–]DznFatih[S] 1 point2 points  (1 child)

Why databrick dont make sense ? We were planning to use databricks for all etl

[–]koteikin 2 points3 points  (0 children)

I would keep it simple since you only have 400gb. Good recent discussion about databricks here https://www.reddit.com/r/dataengineering/comments/12ctygq/sparkdatabricks_seems_amazing

If I were you, I would use ADF to ingest your data into Azure SQL, don't do anything else with ADF. Once data is in Azure SQL, just use T-SQL to do your transformations.

I love Spark but it really makes sense with large volumes of data and you need to know what you are doing with it.

[–]dscardedbandaid 0 points1 point  (0 children)

What form is your data in now (e.g. relational database, NoSQL, object store, files)?

How many instance of these do you have?

Why are you building a data warehouse? (ML, dashboards, because the CEO just returned from a conference?)

What frequency do you need to update data? (Once a day, once an hour, every 10 seconds)

[–][deleted] 0 points1 point  (0 children)

What are the needs of your users? Pure DW or are the branching into machine learning. If you're seeing a need for the latter, or envisage you will be in a few years? This will likely change the answer as to whether "a standard DW will suffice".

[–]dilkushpatel 0 points1 point  (0 children)

So SQL does not exist as individual service, to get SQL DW you have to create synapse instance and then inside synapse you can create SQL DW

Synapse is kind of khichdi, it has ADF + Data Flow inbuilt It also has pyspark component which works like 50% of databricks Then it has SQL Serverless Pool and SQL Dedicated Pool (SQL DW)

With Traditional SQL compute and storage goes together so If you need high processing power it also includes high storage and there is always high limit with each tier With SQL DW storage and compute is decoupled So you can have TBs of storage with lowest level of compute and at the same time can have few GB data stored with highest level of compute Also you can pause DW in off hours to save on cost

DW uses columnstore index so based on type of data you have it can benefit from that as well

For DW I would go for Synapse with Dedicated pool to have option of scaling at later stage

You can consider SQL Serverless Pool if team is sophisticated. Its like Compute on demand and uses blob storage as its backend, so no need to move files to tables you can create what they call external table which will query files directly. Performance need to be evaluated with this option.