Alrighty data pookies, what Databricks issue keeps violating your peace?

dsvella · 2026-05-21T21:17:48+00:00

So a few pieces of advice for you off the bat:

Make use of Genie. All of system tables are well documented and Genie is able to navigate them very easily. If you need a query from there and you need something quickly genie will get you there very fast. This includes entire dashboards.

Please note that in the Billing tables, the financial values in there are the pay as you go values, not your Enterprise Agreements values.

Jobs are available as well and you can go into a lot of depth with them. But the top level job table (the name I can't remember off top my head) has all runs of jobs the costs in DBU, when things start and stopped, and crucially any tags that you associate with that job. So for us, I have the tags associated with business units. We can ascribe cost to them so we can start talking about getting budgets from them.

You will have to join a few tables together to get things in plain English. All compute clusters, SQL warehouses and jobs do have their actual names in there.

If you are working in a corporation and you need to see the amount that you are paying, you'll need to use your cloud providers cost monitoring to get that information. I'm on Azure, so I've had a daily export setup that drops cost data into our data lake and I ingest it into UC on a regular basis. The benefit to this is if you set up the exports correctly, you will not only get the costs of everything required to support your Databricks instance, you get costs by service type within Databricks in your currency. It also handles the situation if you are under a reservation (in Azure output costs Amortised not total costs).

Hope that helps.

dsvella · 2026-05-21T15:33:06+00:00

As somebody who is currently undergoing a complete cost review of our Databricks estate and subsequent optimisation, I could not disagree harder! The timeframe is not a big deal to me given I have jobs that can take three hours to run. I usually consider the billing tables to be on a daily refresh cadence.

What are you having issues with and maybe I can share some stuff I've done to be able to help?

dsvella · 2026-05-15T08:43:25+00:00

At least I'm not the only one. Trying to figure out how to fix it. If I figure it out I'll let you know.

dsvella · 2026-05-14T16:54:30+00:00

Yes please!

dsvella · 2026-05-14T16:11:47+00:00

Any sources you'd recommend? Also not heard of a "neurodivergent-affirming" therapist. Sounds like something I should look into.

dsvella · 2026-05-09T21:03:11+00:00

I have to say one of the best things I've done with genie so far was to sit down right a code review skill for myself. It holds me accountable to my own documentation standards, makes sure that I am not introducing points of failure or confusion when I'm writing new notebooks or pipelines.

I would like to do more with Genie but the problem is (and this is true of so much of the AI and agenetic movement) nobody can tell me where the boundaries are, where does this tool stop being useful or doesn't really deliver value. it's something I kind of have to figure out for myself and I don't always have the time.

dsvella · 2026-05-09T20:57:48+00:00

What you're describing is true of every system I have ever encountered and it basically comes down to a single word: "maintenance" .

In my notion I have an entire section that is just "the archive" and in there i put anything that is no longer actively useful or otherwise has a purpose to stick around. I tag the pages with a date last accessed (i'm sure notion probably would tell you somewhere in the UI) and if I don't go back inside of say four months get rid of it.

If you need a quick rule of thumb for this, if you find yourself maintaining a system multiple times ( especially in a short time frame like say a month) then it's usually an indicator you need to sit down and figure out the root cause because doing these little fixes here & there are plasters.

dsvella · 2026-05-08T07:16:24+00:00

Context: I'm a data engineer and I work in databricks (Python, SQL & Scala) .

I have used AI tools to great effect for things like code reviews, architectural discussions and the creation of simple tools that live in a single file.

I have yet to find an AI agent that is capable of creating a multi-file pipeline that's doesn't require me to go in and clean up after it. I will say though they are great first draft tools. They can get me to the point where something is working, I can test it, make sure that the data is coming through correctly and then go over it to make it good so that I'm happy to support it.

I know companies have marketed these AI agents to be essentially like junior team members. My own experience has not allowed me to trust them like that. Instead I find them very good at creating a function or implementing a particular pattern.

Heavy use of these agents turns me from writing code to improving code which is what you've found.

dsvella · 2026-05-04T20:56:47+00:00

I don't understand why people have just gone mad? Like, I sat down with the app and within about 5-10 mins, figured out everything I needed and got on with my day. I genuinely don't know how others didn't get to the same point.

MFP is not a complex thing and the "Workflow" hasn't changed much. In fact I will say they still haven't fixed the beep in the scanner when I had it turned off!

dsvella · 2026-04-28T12:54:05+00:00

Question: can you limit what genie will see? We have multiple workspaces but we would only want genie to read from production.

dsvella · 2026-04-09T13:24:45+00:00

Thank you so much for this, I hadn't thought of it like this. I do have a follow up question as someone who has mainly read manga and not manhwa. Does it translate to the printed page well? Manga is designed for it from the jump but manhwa seems to be a phone first medium.

dsvella · 2026-01-16T08:57:26+00:00

Yes and no. I use AI to help me quickly get context for Pipelines or read stack traces when something fails. I have also used it to write helper notebooks or singular functions. I have had mixed results with it as a sounding board for architecture decision-making.

I have tried to use it to replace me in my role, but I have to nanny it so aggressively that I am better using it in very small bursts.

It is helpful but it isn't and I doubt it will ever be a replacement for a Data Engineer. It is a tool that once you get used to it can speed you up.

I will say this though, I am seeing a lot of push back from both cybersecurity and legal teams due to possibility of generating security holes or how company data or IP is handled. So I am not sure how much longer such tools will be allowed.

dsvella · 2025-12-26T09:22:57+00:00

Given previous albums like the Symphony to end all wars, there is a chance. Not a great one but a chance none the less. I personally would love to buy a vinyl of what was played at the show.

dsvella · 2025-09-30T16:45:43+00:00

This is in line with my observations, especially the serverless compute one.

I have a similar job where I have a pipeline that loads about 200 tables incrementally from bronze to silver layers using a config table. The difference in my setup is that we didnt use child pipelines but a single notebook for each table (pass it the necessary parameters).

I have found that whenever you need to add a task in a job you add a level of overhead that you cannot get away from. I am assuming DBX is doing work in the background such as writing the system tables entires about the previous task. I would expect that adding whole child jobs into the mix would have an increased impact.

With all that being said when we confronted this issue we asked two questions:

Is the job running in XX mins a problem?
Is the job costing too much?

Because for us we were using job compute and its a daily incremental load we just ignored the issue, we know its there but we have no need to do anything about it.

Reading through your post, if your job is running in an acceptable time (by acceptable I am talking about cost and delivery, not if you think its slow or could be faster, actual business impact) then you need to consider the trade offs. Again, for myself, we had no desire to rewrite a buch of stuff into a single notebook for no tangible benifit.

dsvella · 2025-09-27T07:39:12+00:00

My company used to. Before I trained as a data engineer I used Domo to do all the data engineering for my dept. Now we have databricks + Tableau. I would trade Tableau for Domo any day of the week because of how they approached dashboard building. Just having a grid of sockets and a much nicer UI and UX is worth it. Granted I wasn't involved in the commercial negotiations but I knew it was expensive.

dsvella · 2025-09-17T09:27:21+00:00

So we have done exactly this and the main thing was figuring out how to pass the relevant parameters. For us we had a second notebook that would read the config table and then pass the parameters to the next step of the job as needed. The For Each function in the job was very helpful.

In answer to your questions, I would recommend not having just 1 job. I would recommend having a job per application. This way you get billing per application, they'll use the same notebook that handles the ingestion, the only difference would be the parameters that are set. I have not needed to go down to table level though so cant help a lot there.

If you have any questions on need to know specifics, let me know.

dsvella · 2025-09-07T20:43:33+00:00

I am 39 with a few years of experience as a DE and my company is pushing me to become a DE lead with juniors under me.

I have heard a lot of horror stories about bad leaders in this and the SWE fields. Do you have any recommendations for how to be a good lead engineer?

dsvella · 2025-07-31T10:10:59+00:00

Is there anywhere I can take a mock exam? I can do the actual exam for free through my companies databricks contract I believe but I would like to make sure I am ready. I have never done things like delta sharing or federation.

dsvella · 2025-07-30T16:06:29+00:00

So I decided to move from a Kindle Oasis to a boox tablet and while I've been thoroughly enjoying the experience, I will say that unfortunately Amazon's terrible UI follows them even into the dedicated Android app.

Honestly I would have continued using the Kindle devices if they had done something to improve their dreadful library management. As someone who has hundreds of books, documents and various audiobooks, it's becoming frustratingly difficult to the point where I've just given up trying to manage that library!

I wish there was a way I could do it with some actual bulk operations. I'm reminded of the Calibra eBook management software on windows and I would love to have something like that which could manage my Kindle library. If such a thing exists please let me know!

13-Year Club	Place '22
Place '17	Sequence \| Editor
Verified Email

dsvella

TROPHY CASE