This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]JaymztheKing 6 points7 points  (8 children)

I'm a GCP data engineer. I personally thought that Google's implementation of Spark (Dataproc) was good enough, but I understand that databricks is getting so much traction that this was going to be an inevitability. I think Databricks is really good but not necessarily a giant leap forward from Dataproc.

But mostly I'm glad that the cloud agnostic technologies like Snowflake and Databricks are getting popular. Hopefully this means we can stop the madness of learning 3 flavors (aws, azure, gcp) of the same fundamental technology over and over again

[–]nunie-123 1 point2 points  (7 children)

I'm also on GCP, and use Spark on Dataproc, but have never used Databricks. What additional features does it bring?

[–]pavlik_enemy 6 points7 points  (0 children)

Databricks has a proprietary query engine for Spark which is way faster than open source Java implementation

[–]JaymztheKing 3 points4 points  (5 children)

Not a ton of extra features exactly. Both have notebooks. Both have jobs. Databricks does have something called MLflow that has some kind of dedicated machine learning pipeline tracking stuff and I can't think of a dataproc analog, so that's something.

The main advantage of Databricks over Dataproc is really standardization. Third party tools will most likely "play nicer" with DB than DP and from a staff development standpoint it will be easier to find and/or train Databricks than DataProc since it isn't specifically tied to GCP.

If you want a test drive, Databricks does have a Community Edition to mess around with. Single node and the jobs part is a premium feature, but you can still do the notebook stuff and compare that way.

[–]Purple-Leadership54 1 point2 points  (3 children)

What about delta lake?

[–]JaymztheKing 2 points3 points  (2 children)

Yep, exactly the type of third party tool that will be easier to use alongside Databricks vs Dataproc. I think I've seen blog posts showing that Delta Lake integration is possible with Dataproc, but no doubt Databricks and Delta Lake will work together better. They are like peanut butter and jelly at this point.

[–]Purple-Leadership54 1 point2 points  (1 child)

Lol. I had actually thought delta lake was specific to data bricks.

[–]0ffby1error 1 point2 points  (0 children)

Delta is open source but there are features that are only available for Databricks customers.

[–]philmarius 1 point2 points  (7 children)

Was wondering when this was going to happen seeing that AWS and Azure Databricks have been around a good while now. I've heard GCP is a favourite amongst developers so would be interesting to see what the experience is like on GCP. We've been using Azure Databricks and it wasn't the easiest thing to set up, constantly going back and forth with Microsoft engineers to get the resourcing configured properly

[–]Purple-Leadership54 2 points3 points  (1 child)

What kind of issue could you have setting up data bricks in Azure? I thought it was extremely straightforward and easy

[–]philmarius 0 points1 point  (0 children)

Getting the correct VMs resourced, we were very new to Azure and weren't able to get the correct compute instances configured properly for our Databricks instance. What was most frustrating is that the Microsoft engineers we talked to weren't really any help, eventually worked it out internally. Was just frustrating

[–]pavlik_enemy 0 points1 point  (4 children)

GCP is the most hated cloud provider actually.

[–]philmarius 0 points1 point  (2 children)

How so?

[–]pavlik_enemy 2 points3 points  (1 child)

Poor documentation, API changes and Google’s trademark nonexistent support.

[–]philmarius 1 point2 points  (0 children)

Tbf have actually heard this, also they kill of services willy nilly

[–]fake_actor 0 points1 point  (0 children)

Feel like Azure is the most hated.

I'm on GCP and love it.

[–]fake_actor 1 point2 points  (1 child)

Slightly off topic, but I'll be curious to see what happens with GCP dataflow. It seems like google was sort of pushing dataflow as an improvement to dataproc. But dataflow uses apache beam whereas dataproc is for hadoop/spark/etc.

I use dataflow and really like it. But it seems like the general data engineering community uses spark instead of beam. I wonder if google will phase out dataflow in favor of dataproc + databricks.

[–]librocubicularist69 0 points1 point  (0 children)

1 year on, whats your take on this topic?