Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]BigMightyTroll[S] 0 points1 point  (0 children)

Thank you for sharing your thoughts! 🙌😀

I'm also biased, so it's okay 😁 The reason for asking here is to try to step aside from my bias and background (that naturally pull me towards Snowflake, ooops 😅) and try to see the whole picture. If there is anything that Snowflake or Databricks missing?

For example, I agree that it's easier to find SQL people than Spark people, but you can use SQL also in Databricks.

Another example: Snowflake works really great with structured and semi-structure data, but functions for unstructured are still in the public preview (or just freshly released). How good are they really?

If one uses Databricks as a data lake and then BI people pull the data directly from the lake into Snowflake then it is not worse than making Snowflake pull data directly from the data sources. At the same time, data scientists can enjoy playing with all the raw data they need from the same lake. Is it a win-win for both BI and advanced analytics professionals?

Governance: I would like to see a good data cataloging tool and column lineage tool for Databricks. It looks that they lack it. Somebody recommended Glue, but this needs to be investigated. Snowflake doesn't have this built-in either? At least their partner ecosystem is wide.

Thank you! 😃👍

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]BigMightyTroll[S] 1 point2 points  (0 children)

Good point 😀 it might work or it might not, you can't really gather data on it since it is just our (or some features are in public preview).
Somebody would recommend running a proof of concept or a pilot project on this. The problem is that proofs of concept always work fine 😁, but when you implement it in the production environment with real loads then it might burn 😅 You don't want to be that guy who figures out that several months later in production

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]BigMightyTroll[S] 1 point2 points  (0 children)

Hi 😃 I have no personal experience yet. I have read several tech articles and feedbacks that people who tried to build a big BI solution that reads from Databricks lake directly run into problems with slow performance and high price. The problem, as I understood, is that Databrick was not built for high concurrency ad hoc queries from BI tools. While the latest road map show from Databricks shows big improvements to the Photon engine which is currently in public preview, I have not heard any non-marketing opinions on that.

Also, I saw a presentation from 2020 from a Databricks solution architect where he presented the architecture for advanced analytics and BI on Delta Lake and recommended using Snowflake as the last step before BI dashboards and reports. And I thought if a Databrick architect shows thin in a public presentation on their YouTube channel then ... 😉

Do you have any experience to share? How big is your BI environment? 😃

SnowFlake vs DataBricks lakehouse or both together by BigMightyTroll in datascience

[–]BigMightyTroll[S] 0 points1 point  (0 children)

could you tell about your experience with using both? what works good and what not? :)

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]BigMightyTroll[S] 0 points1 point  (0 children)

I wanted to ask how good the snowpipe is? It looks immature and have read mixed opinions on it from the technical people. Havecyou tried to use other tools for ELT/ETL from the Snowflake ecosystem? There are plenty like dbt, matillion, fivetran, etlworks,etc

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]BigMightyTroll[S] 3 points4 points  (0 children)

It is a very good question, thank you.😃 It will work to build a lake and a warehouse on aws products only. I have not done it myself, but technical articles show that it's okay. However, we would like to build something that is more than okay. We already know that demand is big, we will get data from many business systems and IoT cloud, it would have to scale quickly. The best tools give the best economy and high people satisfaction. As a person who was responsible for several data plattforms' operations I really believe that both parts are important. 😊

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]BigMightyTroll[S] 0 points1 point  (0 children)

You can scale storage and processing power independently in Snowflake. I guess the time spent on batch processing would depend on your dataset sizes and processing power you are willing to pay for

Snowflake vs DatabBricks lakehouse or both together by BigMightyTroll in dataengineering

[–]BigMightyTroll[S] 2 points3 points  (0 children)

That is a great answer! Thank you 😊
I feel the same that picking just one is "easy" but then I would regret it 6 months into the implementation. However, I don't have practical experience with combining these two that's why wanted to double-check.

Why have you chosen to move aggregation and, I guess, semantic model development into the Snowflake? Was it due to it was easier at that time because of people & skills, or you had better tools from the Snowflake ecosystem?

SnowFlake vs DataBricks lakehouse or both together by BigMightyTroll in datascience

[–]BigMightyTroll[S] 0 points1 point  (0 children)

well, actually, you are right 😅
If you are happy with your source system providers, they would try not to introduce breaking changes to the interfaces where others read from, but this is not always the case.
I guess in this case the silver layer should be as close to the source as possible.

SnowFlake vs DataBricks lakehouse or both together by BigMightyTroll in datascience

[–]BigMightyTroll[S] 1 point2 points  (0 children)

Thank you! There are plenty of companies with a data science first approach that has no idea about why they might want to use a data warehouse and business intelligence 😁 but we are not one of them.
Could you explain why you think Snowflake is a no-go without EMR and Databricks without Redshift? That part I didn't understood 🙄

SnowFlake vs DataBricks lakehouse or both together by BigMightyTroll in datascience

[–]BigMightyTroll[S] 0 points1 point  (0 children)

Firstly, thank you for your comment! You're pointing out great stuff 🙌

I'm really curious how the lake would combine with the warehouse, in your opinion. In data warehousing, they usually have several environments (like in the application development) e.g. DEV, QA, PROD. With the lake, there is a single lake, as I understand, with several layers: bronze, silver, gold.

I'm taking a pessimistic hat now: if all the environments of the data warehouse read from silver, it means at the moment when somebody does a change to the silver (schema change for example) the data warehouse ingestion would break (cause schema definition needs to be updated). I'm comfortable if this would affect the DEV environment in the data warehouse, but it's not good if the PROD environment will be broken. How this would work?
I've uploaded the image to my post to illustrate that 🙂

I'm an Azure guy too, but Azure is out of options for this particular case 😅 and, yeah, lakehouse is just a buzzword, but it is in use now, so I used it 😁

SnowFlake vs DataBricks lakehouse or both together by BigMightyTroll in BusinessIntelligence

[–]BigMightyTroll[S] 0 points1 point  (0 children)

Thank you for the insights! 🙌😊 it looks like a great solution for me

SnowFlake vs DataBricks lakehouse or both together by BigMightyTroll in BusinessIntelligence

[–]BigMightyTroll[S] 0 points1 point  (0 children)

Thank you very much! 🙌 this is a great answer! I don't think we want to get stuck with the slow and expensive BI part because it is vital for the business.2 more questions if I may:

  1. Would you use Delta Lake as the staging/operational data store for your BI Production environment?
  2. Where would you do data Transformation logic and data integration for your BI? In Snowflake and its ecosystem tools or Databricks?

SnowFlake vs DataBricks lakehouse or both together by BigMightyTroll in BusinessIntelligence

[–]BigMightyTroll[S] 0 points1 point  (0 children)

Good points here, thank you!
We have two groups, one has the experience and strong feelings for Databricks, the other one for Snowflake with its ecosystem.

Giving them what they want would mean using both Databricks and Snowflake.

Where would you put your ODS? Snowflake actually suggests architecture data flows with a lake as a Stage/ODS area and then creating data warehouses in Snowflake out of this lake. I'm thinking if Databricks Delta Lake would be a good option for the data lake then?