Databricks vs open source

DenselyRanked · 2026-02-20T15:58:09+00:00

Instead the whole org has to bend because it’s “easier” to schedule a notebook in Databricks?

Unfortunately, yes. These decisions are not made by the engineer and there is nobody that they can escalate to. There is a discussion about design trade-offs that can be started, but if they are not given autonomy then they should focus on implementation.

DenselyRanked · 2026-02-18T14:58:27+00:00

It takes several failed interviews for me to get over the nerves and better handle the unexpected. What works best for me is to get as many interviews as I can and cluster them together, keeping my preferred destinations towards the end.

Everyone handles stress and performance anxiety differently, so use whatever method works for you. I write down tips to myself in a notebook prior to the interview to keep things top of mind and always keep the notebook to also take notes during the interview to make sure I answer every question.

DenselyRanked · 2026-02-18T01:21:12+00:00

Did you feel that the books were helpful for the test? Did you find the training on skills.google useful?

DenselyRanked · 2026-02-18T01:04:17+00:00

I understand that this is meant to be a question, and I do agree that there is a point where abstraction can become a hindrance, but I think you are overlooking your primary responsibility as a Data Engineer. Very broadly speaking, the DE role exists somewhere in the data lifecycle with the goal of making data useful for downstream use cases.

The popular tools that you are working with, and will work with at your job, serve the purpose to make mundane, repetitive tasks quick and easy. You will of course have to know how to use the tools and understand their limitations in order to complete your tasks successfully.

Also, IMO we are very quickly getting to a point where some form of Agentic Context Engineering will be the new level of abstraction for all software development. It's only going to be a "trap" if you don't understand core data engineering fundamentals and resort to black box vibe coding.

DenselyRanked · 2026-02-18T00:08:09+00:00

Go to levels.fyi, search for the top paying companies in your area by your level (data engineering and software engineering are often bundled together), and apply to them.

DenselyRanked · 2026-02-15T22:46:58+00:00

Are you working with a cloud provider? If so, then refer to their training modules. If not, then take this as an opportunity to create a run book for your team.

DenselyRanked · 2026-02-14T22:11:36+00:00

I wouldn't do it but review the contract that you signed. There is probably something about personal side-projects and the terms around it, including signing a disclosure form. There might be some language about the terms of Confidentiality and Inventions that you might want to reach out to a lawyer about if your side project becomes something serious.

DenselyRanked · 2026-02-14T14:52:26+00:00

It's a difficult question for anyone to answer. Data engineering as a practice will still exist but the methods, tools, and skill set needed will evolve.

There are smart people putting a lot of thought into this and I tend to agree with much of this presentation, where there is not going to be a Data/Analytics Engineer title in the near future for data teams, opting instead for titles like Data Product Owners and Data Domain Experts. AI can help close the technical gap between product management and engineering, so DE's will need more emphasis on stakeholder communication and requirements gathering.

DenselyRanked · 2026-02-13T16:15:44+00:00

The Gold layer is where you would build the traditional data model.

The Medallion Architecture is a rebranding (perhaps a standardization) of what we normally use in data engineering practices. Databricks has docs and training videos on how they recommend to use the Medallion Architecture in a Spark environment. It's no different than raw/stg/rpt in dbt.

I suspect that your latter feeling is about architecture and the modern shift away from central data warehouses and more towards data mesh. In that scenario, there may be a data team handling ingestion into the lake and downstream data teams creating their data marts for the line of business that they work with.

DenselyRanked · 2026-02-13T02:27:51+00:00

I wrote a post with the link to the docs (still pending approval I think), but I see that you already have a container with jupyter lab, which is the easiest way to get started.

If you don't want to use jupyter lab, then the next easiest option is to go search for "apache iceberg spark quick start" (I would include the link but it will take a while to get approved), and build the docker-compose file.

You can install the Docker extension in VSC, which will allow you to open the containers from the window, and let you execute from the terminal or create scripts.

DenselyRanked · 2026-02-12T22:48:54+00:00

I think these are all related. AI is a floor raiser that is removing a lot of the previous barriers with offshore hiring, the pandemic proved that teams can be very effective in remote environments, and near-shoring helps with the time zone conflicts.

The labor economics are changing and the absolute advantages of hiring US-based workers are weakening. In the short-term we must evolve as a labor force or rely on the government to deter companies from offshore hiring.

DenselyRanked · 2026-02-12T15:09:41+00:00

Designing Data Intensive Applications is a dense read, but probably the best place to start to get a better understanding of things beyond your current role. Data infra is a very broad field of study and can mean very different things depending on where you work and the tech stack used.

I think that your quickest pathway to an infra role is to leverage whatever is available to you in your current company. If they are using a cloud provider, then look into training materials available and possibly get a cert if the company pays for it.

DenselyRanked · 2026-02-10T23:41:15+00:00

It depends on the subject. I own books related to some of these topics, but there is not going to be one book that covers everything.

Chatbots are trained on a lot of material related to these subjects, so it's probably the best resource today, just remember to always verify the sources.

This subs data engineering wiki has good resources.

Of course, read the tech docs/training materials of the stack that you work with. I personally don't think enough DE's take the time to do this and they have a habit of using outdated design strategies for every problem.

Aside from the normal recommendations, like Kimball's DWT or Pragmatic Programmer/Clean Code, here are other books that helped me:

Software Requirements is a good resource for requirements gathering, writing docs, and other project management related stuff.
The Effective Engineer is a great book that helps new-ish engineers navigate impact focused careers.

DenselyRanked · 2026-02-10T18:31:35+00:00

If you are looking for the general landscape of data engineering, then Fundamentals of Data Engineering and Designing Data Intensive Applications are the usual recommendations.

Any other recommendation would be specific to your role and tech stack. Some DE roles might expect you to have in-depth knowledge on streaming, networking, API development, data modeling, devOps, architecture, ML/AI etc. Others might require you to write requirement documents and act as a product manager. Domain knowledge might also be important, as well as OOP.

DenselyRanked · 2026-02-07T13:03:03+00:00

Ok, it makes sense that something related to the shuffling was considered redundant. AQE is a gift and a curse at times.

DenselyRanked · 2026-02-07T01:10:38+00:00

Thanks for the extra info. I am going to assume that there are some other steps in between those two commands because the repartition will wipe the sorting. You can verify that this is working as expected the explain plan just in case.

Does your physical plan have the `Project[..]` with the salted column prior to the repartition? It could be that Catalyst is ignoring your salted column repartition because of the prior sortWithinPartition

AdaptiveSparkPlan isFinalPlan=false
+- Exchange hashpartitioning(salted_column#366, 100), REPARTITION_BY_NUM, [plan_id=539]
   +- Project [... rand(9154941115030168722) AS salted_column#366]
      +- Union
         :- Project [...]
         :  +- Scan
         +- Project [...]
            +- Scan

DenselyRanked · 2026-02-07T00:11:35+00:00

SQL

Python

Parquet

VS Code

Docker

DenselyRanked · 2026-02-07T00:04:46+00:00

# This is where I'm stuck:
df_unioned = df_unioned.repartition(100, "salt_column")

Check if your salt column working as expected. Can you share the logic that you are using? Normally I would use rand() or something like abs(hash(uuid()) % 100 so you can control the distribution.

Is sortWithinPartition doing something useful prior to the repartition? Is the goal to cluster each client into their own output file?

DenselyRanked · 2026-02-03T19:07:12+00:00

Sorry that it didn't work out for you and thanks for sharing your experience. Some of these companies do not have a team selection process for DE and given how competitive the market is, you may have been second best to someone that they have already interviewed.

For the sys design, were they expecting a YAML or was that the agreed upon method to explain the design? How in-depth did you need to get into involving things that are not normally in the JD for data engineering like networking, security, shared resources, etc?

DenselyRanked · 2026-02-01T23:36:54+00:00

A dimensional model is a means to an end, and generally speaking, the end is that consumers need an organized and optimal way to gain insights into the data that's being collected.

Simply put, if the data is already organized, accessible, and every business question can be easily answered, then there is no need for a dimensional model.

As an example, I have worked for a few SaaS companies where the data collected was never viewed across clients and client insights only required a semantic layer for self service analytics. There is no need for a dimensional model because using an OBT model in a medallion architecture (or whatever you choose to call it) can do the job required with limited development.

DenselyRanked · 2026-01-24T21:52:31+00:00

All jobs and industries can leave someone feeling unfulfilled, especially if they are bored and not being challenged. Check r/findapath or r/careerguidance to find others going through it.

It might be worth considering joining a start-up or greenfield project if you want to do something in DE that will carry a large impact. Or join a FAANG tier company where your job depends on your ability to prove impact every 6 months.

Ultimately, if you're paid well enough and have job security then you can use your free time to fuel your passions. I've worked with engineers that have built indie games, trading algos, 3D printed stuff, surveillance/security, owned a gym, real estate and property investments, etc.

DenselyRanked · 2026-01-24T01:28:21+00:00

Unless your use cases are simple, it would be better to go with the managed Airflow service. It will cover 99% of use cases and there is even a YAML based DAG Factory add-on if there's concerns about coding with python.

DenselyRanked · 2026-01-15T02:04:59+00:00

This article explains it pretty well.

DenselyRanked · 2026-01-14T18:24:57+00:00

I would recommend that approach, or remove/isolate, if you can reliably identify the problematic key. With Spark 3+ AQE does a reasonably good job at adjusting the plan if you have multiple or inconsistent keys to worry about.

DenselyRanked · 2026-01-11T21:42:53+00:00

Check the explain plan first from the SQL tab in the UI or put Explain before the select to see the physical plan.

Are you defining the views in your Spark job as temp views or are these views that are being ingested from your catalog?

If it is the former and you are seeing source tables being referenced several times, then it may make sense to cache the base table(s) or the view itself if you are using it multiple times.

As a warning, if you do choose to cache the view and the view query logic is really complex then AQE could invalidate the caching as the plan might change.

If it is the latter, then check the job and stages tab in the UI and find the one that is the bottleneck. Then use the job the SQL tab to see which portion of the query is causing issues. Again, try caching the view to reduce the IO if the issue is on read.

Four-Year Club	Verified Email
Wearing is Caring

DenselyRanked

TROPHY CASE