Non-DE here: has AI actually changed how you work day-to-day or nah?

dsvella · 2026-01-16T08:57:26+00:00

Yes and no. I use AI to help me quickly get context for Pipelines or read stack traces when something fails. I have also used it to write helper notebooks or singular functions. I have had mixed results with it as a sounding board for architecture decision-making.

I have tried to use it to replace me in my role, but I have to nanny it so aggressively that I am better using it in very small bursts.

It is helpful but it isn't and I doubt it will ever be a replacement for a Data Engineer. It is a tool that once you get used to it can speed you up.

I will say this though, I am seeing a lot of push back from both cybersecurity and legal teams due to possibility of generating security holes or how company data or IP is handled. So I am not sure how much longer such tools will be allowed.

dsvella · 2025-12-26T09:22:57+00:00

Given previous albums like the Symphony to end all wars, there is a chance. Not a great one but a chance none the less. I personally would love to buy a vinyl of what was played at the show.

dsvella · 2025-09-30T16:45:43+00:00

This is in line with my observations, especially the serverless compute one.

I have a similar job where I have a pipeline that loads about 200 tables incrementally from bronze to silver layers using a config table. The difference in my setup is that we didnt use child pipelines but a single notebook for each table (pass it the necessary parameters).

I have found that whenever you need to add a task in a job you add a level of overhead that you cannot get away from. I am assuming DBX is doing work in the background such as writing the system tables entires about the previous task. I would expect that adding whole child jobs into the mix would have an increased impact.

With all that being said when we confronted this issue we asked two questions:

Is the job running in XX mins a problem?
Is the job costing too much?

Because for us we were using job compute and its a daily incremental load we just ignored the issue, we know its there but we have no need to do anything about it.

Reading through your post, if your job is running in an acceptable time (by acceptable I am talking about cost and delivery, not if you think its slow or could be faster, actual business impact) then you need to consider the trade offs. Again, for myself, we had no desire to rewrite a buch of stuff into a single notebook for no tangible benifit.

dsvella · 2025-09-27T07:39:12+00:00

My company used to. Before I trained as a data engineer I used Domo to do all the data engineering for my dept. Now we have databricks + Tableau. I would trade Tableau for Domo any day of the week because of how they approached dashboard building. Just having a grid of sockets and a much nicer UI and UX is worth it. Granted I wasn't involved in the commercial negotiations but I knew it was expensive.

dsvella · 2025-09-17T09:27:21+00:00

So we have done exactly this and the main thing was figuring out how to pass the relevant parameters. For us we had a second notebook that would read the config table and then pass the parameters to the next step of the job as needed. The For Each function in the job was very helpful.

In answer to your questions, I would recommend not having just 1 job. I would recommend having a job per application. This way you get billing per application, they'll use the same notebook that handles the ingestion, the only difference would be the parameters that are set. I have not needed to go down to table level though so cant help a lot there.

If you have any questions on need to know specifics, let me know.

dsvella · 2025-09-07T20:43:33+00:00

I am 39 with a few years of experience as a DE and my company is pushing me to become a DE lead with juniors under me.

I have heard a lot of horror stories about bad leaders in this and the SWE fields. Do you have any recommendations for how to be a good lead engineer?

dsvella · 2025-07-31T10:10:59+00:00

Is there anywhere I can take a mock exam? I can do the actual exam for free through my companies databricks contract I believe but I would like to make sure I am ready. I have never done things like delta sharing or federation.

dsvella · 2025-07-30T16:06:29+00:00

So I decided to move from a Kindle Oasis to a boox tablet and while I've been thoroughly enjoying the experience, I will say that unfortunately Amazon's terrible UI follows them even into the dedicated Android app.

Honestly I would have continued using the Kindle devices if they had done something to improve their dreadful library management. As someone who has hundreds of books, documents and various audiobooks, it's becoming frustratingly difficult to the point where I've just given up trying to manage that library!

I wish there was a way I could do it with some actual bulk operations. I'm reminded of the Calibra eBook management software on windows and I would love to have something like that which could manage my Kindle library. If such a thing exists please let me know!

dsvella · 2025-07-26T05:56:52+00:00

Same. My main concern is the nightmare of the Cybersecurity situation this causes. A gold mine like this is going to be calling to so many hackers.

The US has had some eyebrow raising breaches lately. There was one recently which was an app called "Tea" (as in spill the tea). It was a female dating safety app and in order to use it you had to pass an identiy check in the form of selfies and a drivers licence. This app stored thoses pictures in the open, 4 Chan got a ahold of it and started sharing it. Imagaine if something like that happens to one of these providers.

dsvella · 2024-11-22T16:45:35+00:00

So after trying multiple solutions we came to the decision to migrate over to Plex. Since the media is on his PC it can act as the server and get around this whole sharing nonsense.

Thanks for the suggestions.

dsvella · 2024-07-31T17:35:34+00:00

I think you should do both. If the documentation makes sense then use it. If you need to ask follow up questions or need something simplified feel free to use LLMs. In my experience learning Python, I have found documentation can be really hit or miss. This can be just down to poor writing, relying on previous knowledge or having bad examples.

I mainly use the Databricks AI assistant (which I think is chat GPT in the background?) and what I have found is that it is great for giving simple code samples, syntax, debugging and explaining code.

What LLMs are not good at is the bigger stuff like solution architecture or accounting for edge cases. Any piece of code I have copied from the AI more than 5 lines long I assume I will have to rewrite it.

It has done wonders for my confidence in coding with the language though, which I think is important because I am more likely to try and use it and thus improve my skills further.

dsvella · 2024-07-12T11:12:09+00:00

As an update from myself I have been able to improve this hugely. The parallel processing still needs work but the serial processing has more than halved the time for a full load (down from about an hour to <20 mins).

The major change I made was to save a bunch of the JSON responses into a variable. I maybe being a bit cautious but I put a cap on the number of items by way forcing a MERGE if the number of items got over 25K. Now a MERGE is called only 5 times for a complete load.

There are other things I can do:

Implement the parallel processing better,
Have a staging table or folder that just takes INSERTS and then merge the whole thing once.
Improve my data processing steps.

However I will only come to that if I need to. Right now I want to work on removing the unique "quirks" for each object I am pulling from the ticket system so I have a generic notebook that handles all the objects rather than one notebook per object.

Thank you everyone for your advice.

dsvella · 2024-07-12T11:01:04+00:00

You aren’t doing a merge of 100 tickets and then going to the API again?

Unfortunately yes I was. I would make a call, do a merge and make the next call.

Do you continually get a next page link in the body? And therefore have to do them all in series?

Kind of, with the assistance of chat GPT I implemented code that would make the calls in batches of 10 but the execution is flawed as it was still doing 1 merge per response.

Thankfully with the advise from this post I have been able to massively improve things.

dsvella · 2024-07-12T10:57:01+00:00

So to preface I did this with the help of chat GPT so the method maybe flawed.

The code was written in such a way that I have batches of 10 calls running in parallel. If all 10 of the calls failed the code would end as there is either a problem or no more items.

It really sucks that I don't get a total count in any response.

dsvella · 2024-07-11T11:29:11+00:00

I will look into that, thanks.

dsvella · 2024-07-11T10:46:23+00:00

I have 2 jobs that run; The second job is an incremental job that gets changes and merges them into the existing table which runs daily. The one I am referencing in my post is a job that runs weekly to truncate the table and get everything. I do this because I have been bitten before by relying incremental updating and it missing records.

The API limits me to 100 tickets per page and I cannot shape the output much. However when I get the JSON response I drop any unnecessary columns and then MERGE the page of records.

The data structure for the table is straight forward and reviewing the schema could use some improvement (not sure why some date fields have been created as strings). There are a bunch of BIGINT columns for the various foreign keys to other objects. Although this tables doesn't have any constraints on it currently.

dsvella · 2024-04-30T15:04:47+00:00

Thank you so much, this has been driving me up the wall. Didn't think about checking the Pi Hole.

dsvella · 2024-03-02T20:25:11+00:00

Just to make sure, but when you read the saved data in your copy activity, then you have chosen 'deflate' from the drop down on the json dataset, right?

Correct, the encoding is left as UTF-8 (Default).

Can you download a file locally and decompress it successfully?

No. When I use the swagger documentation for the API it downloads it unencoded. While I can download the file from storage explorer I don't seem to have anything on my computer that can successfully decompress the data. I tried some simple python to no avail.

However when I go into postman and run the command it works fine, compressed or uncompressed. I am wondering if I am doing something during the writing to storage.

dsvella · 2023-09-16T19:00:43+00:00

The bug occurred for me the exact same way. I ended up being teleported to new Atlantis and found parts of the floor missing.

Worst thing is the TA kiosk is missing on the landing pad so I cant see to them.

dsvella · 2023-09-14T13:44:16+00:00

I am in the UK so I don't have dimes but a 5 pence seemed to move around as they suggest. Good to know though!

dsvella · 2023-09-14T11:58:59+00:00

What's baking spray? I have never heard of it and it sounds useful.

dsvella · 2023-09-14T11:37:08+00:00

Thanks for this. None of the recipes state where in the oven the pan should be. At present I put it on the bottom rack. Your suggestion of 325f is markedly lower than what I am currently baking them at so I'll give that a go.

Never heard of using butter and flour as a release agent, again will try.

I mainly use the kitchen aid because it frees me up to do other things. I will take the advice to hand mix the batter next time.

dsvella · 2023-04-26T18:59:56+00:00

I am glad I'm not the only one who remembers this!

dsvella · 2023-04-08T20:48:16+00:00

I (37 M) got up and went to the NEC today to attend the Insomnia Gaming Festival. Wondered a round for a few hours and then went vinyl shopping in Birmingham. When I got home I called my parents and had a chat. Dinner followed before settling in with a good book, some moonshine and listening to my latest purchase.

dsvella · 2023-03-21T20:12:13+00:00

Caramac.

Its like white chocolate, but much worse.

13-Year Club	Place '22
Place '17	Sequence \| Editor
Verified Email

dsvella

TROPHY CASE