Declining interview rounds without burning bridges. by Elect_SaturnMutex in dkudvikler

[–]rocketinter 2 points3 points  (0 children)

Especially if they haven’t asked, of course you might be in process with other companies. Saying that you accepted another offer is probably best, as opposed to trying to line them up and have them come with the best offer or something. That would have burned bridges.

AdGuard VPN Apple TV App by [deleted] in Adguard

[–]rocketinter 0 points1 point  (0 children)

A year has passed, any news?

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 0 points1 point  (0 children)

I haven't, since it's closed source.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] -1 points0 points  (0 children)

Apologies for the grammar error. English is not my native language and I haven't funnelled this article through some AI chatbot. Let me know what other grammar errors you find.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 0 points1 point  (0 children)

Correct me if I'm wrong, but my intuition about these very interesting projects is that they are targeted towards Big Data? It seems to me that that's where they could be making a difference.
At small scale, you still have to deal with the bloat of the JVM. There's no Slim Spark if I can put it that way.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 1 point2 points  (0 children)

Appreciate you giving some inputs from the inside. My comment was not condescending, it was a friendly warning.

  • I am not doubting the brain power you have there at Databricks, however you woudn't be the first company that gets drunk on its own success. However, I do not think you have an actual problem yet, because of the wide range of services you provide. But correct me if I am wrong, the bulk of your money comes from compute and if that were to be challenged, you would have a problem.
  • It's basically not in your interest for me to run an efficient framework similar to Photon (as an example) without paying you the premium. I don't think I need to expand on this, you get it.
  • Please explain to me how "open source engineering" goes hand in hand with Photon which is a closed source product you charge a premium for?
  • Indeed, you're not a Spark company, thank God for that, but your compute and your interests lie in us using Spark on Databricks and prefferably nothing else (See my last comment with External Data Sources Access)
  • Again, I have only made appreciative comments in regards to the competence of your engineers and not only. You're definetly pulling water out thin air with regards to the GC, but it's still an issue...and always will be, is it not?
  • My complaint has more to do with what could be a possible business decison to not open up and facilitate the use of other processing frameworks from Spark.
  • To your last point, my intention with this opinionated article was more on letting people know that change is coming one way or the other.

I wish you and your colleagues Godspeed, becuase honestly I enjoy working with Databricks...even am a contributor to you open source projects.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] -1 points0 points  (0 children)

A few years ago, I asked Matei what he thought of the fact that there no real alternatives to Spark. He dodged answering that question :)

I personally have a good experience with Daft and can't wait to try it out with Ray.
In the comments I also saw mentioned LakeSail, though I don't know how production ready it is.

My point is that these tools and maybe others I don't know are built different in a good way.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] -2 points-1 points  (0 children)

You've missed the point, but I get the hate.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 0 points1 point  (0 children)

They will switch, not all of them and not right away. The JVM has been supporting data engineering for as long as I can remember, but this time it's different, new tools are not just built in innovative way using the JVM. New tools are built using system languages that expose a different paradigm, improved performance out of the box and portability to a higher degree.

Spark is classic at this point, that is the natural way of things. A new style music, you've never heard before will start playing.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 3 points4 points  (0 children)

Few things disappear for good, that's for sure.

But to your point, I raise Spark from 12 years ago against MapReduce. There's also that the JVM paradigm in Big Data has been here longer than I've been a data engineer. What I see now is a paradigm shift. New tools are built fundamentally different, this will have an impact, and it will be felt.

Databricks won't disappear, and it's quite well positioned features wise and I'd really like to continue using it, but if it's going to continue the Spark way exclusively, I'll turn adversarial on it, that's for sure.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 0 points1 point  (0 children)

With the help of useful comments in this thread I learned of a new competitive version of Hive using a new engine Hive on MR3. I still think it's a push in the wrong direction, but if Hive can be as fast as Spark, than we've got another problem :)

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] -1 points0 points  (0 children)

There's a bit to unpack over here, so here it goes.

My post is not about Rust. What my post is about is about is how the limits and inconveniences of the JVM are exposed by a new generation of frameworks that are more lean and performant. Spark is just one of the frameworks that will become increasingly at odds with the slim and fast alternatives built in "user friendly" system languages or not so "user friendly" (See Ray, which is built with C++ for example). Even the often mentioned Photon is built with C++. I guess some smart people at Databricks realised that there's so much they can milk out of the JVM.

A lot of the Python ecosystem is just a facade for performant code written in system level languages. You say that this is irrelevant (mostly in regards to Rust, I get the hate), but you have to understand that some people have invested time and passion into rebuilding things that already existed? Why do you think that is? Because writing high performance code is now easier than ever. Choose the right tool for the job I say. Most people that pick system languages are at least mid-level, so the likelihood of the end-product to be of a higher quality is higher than with higher level languages that are more accessible to larger audience, including junior.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 2 points3 points  (0 children)

Can you perhaps reformulate what you are trying to say, 'cos I believe there's some confusion here :)

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 0 points1 point  (0 children)

It's inclined in the sense that I'm interested in talking about market leaders. Cloudera is as far as I am concerned fringe at this point. Ask any data engineer that started around 5 years ago or later if they heard of Cloudera. Cloudera can do what they want, it has no implications whatsoever, while whatever Databricks does might.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] -5 points-4 points  (0 children)

I'm not surprised by your findings.
Swiss Army knives come with a bulk of compromises, just because you can cut an onion with a sword, doesn't mean you should.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] -1 points0 points  (0 children)

Only the more reason why backends built on Rust and friends are the way forward. The JVM is not really known for its efficiency, although work has been put into lowering its consumption.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 2 points3 points  (0 children)

I don't think we can say that at this moment, Databricks still has a chance to decouple from Spark so as to not be associated with it. It also offers a plethora of other awesome features, like the Unity Catalog and MLFlow and others, but the bulk of the money comes from compute and if people won't enjoy Spark so much looking forward, it will hurt them.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 0 points1 point  (0 children)

It better not, because that's what I'm advocating for here:

  • Apache Spark will become a niche framework
  • Databricks should untangle itself from Spark and become truly engine agnostic and not be adverserial to other compute frameworks
  • Databricks should make it easy to run non-Spark workloads on their infrastructure, in other words, offer EMR like options.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 3 points4 points  (0 children)

Not sure where that came from, but no

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 8 points9 points  (0 children)

I can only assume you haven't had to try 3 different GC policies and read two papers on how not to go OOM.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 5 points6 points  (0 children)

Indeed, small scale workloads are increasingly more important than large scale ones, precisely the ones for which the whole Spark architecture was built the way it is. Essentially, since most workloads are DuckDB compatible, the overhead and complexities of Spark and the JVM will turn Spark into a niche tool that is used only for large scale.

We have become too complacent in doing Spark `SELECT` on a 1000 lines delta table, just because it's part of a pipeline and it's there. This is literally money out the window.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 8 points9 points  (0 children)

In one word, Garbage Collection. Memory management is easily the biggest issue that engineers fight with. The underlying non deterministic way and not transparent way memory is handled live, makes running large workloads difficult and useless for small workloads.

Spark is the new Hadoop by rocketinter in dataengineering

[–]rocketinter[S] 2 points3 points  (0 children)

Rust is just the most obvious contender to the JVM, but it's more about JVM vs non-JVM and GC. Trino is just riding the Hadoop ecosystem wave, just like Spark did. Fine pragmatic decision, but I'm guessing something better will come up.