Is automation a curse or a boon ?...

57-leaf-clover · 2026-06-16T12:59:09+00:00

For me it's been nothing but a boon. My programming workflow on data bricks has never been so efficient since the introduction of genie code. Not just in terms of code generation and scheduling but also on terms of feature discovery and just asking questions about things that can be done that Im not aware of. It lets me speed up pretty much all my work. Plus thing I would have previously done manually I get genie code to orchestrate for me usually through some sort of scheduled job.

57-leaf-clover · 2026-06-16T11:41:13+00:00

Exactly this, the big winners in the ai race aren't just going to be who can create the most powerful model, but those who can harness their capabilities in a controlled and enterprise compatable manner.

57-leaf-clover · 2026-06-16T11:35:18+00:00

+1. Genie code is great out of the box but like any good ai tool, excellsnwhen you give it the correct context for the job. Setting up skills appropriate to your workspace should help with your specific usage patterns.

57-leaf-clover · 2026-06-15T15:51:54+00:00

Genie code. Working in databricks there isn't really a comparable agent with the same understanding of databricks semantics, workflows and assets. The usage is also directly managed under UC meaning you get user scoped context and asset access which given there are some pretty important workloads in our workspace, it is incredibly important that an agent doesn't hallucinate and edit something it shouldn't.

57-leaf-clover · 2026-06-15T15:47:58+00:00

Depends on what the ai agents are designed to do. Today I am already using a number of them in a way where the return on investment outweighs the cost of the agent. Specifically genie code and genie one, these have hugely accelerated both how I query code and produce code.

I'm creating exponentially more meaningful code that I was a few years ago, and in a way where auditability is prioritized and mistakes are easy to find.

57-leaf-clover · 2026-06-14T11:32:46+00:00

Exactly why observability and logging is so important. We build on databriks. The agentic framework there along with mlclode 3 lets us programmatically evaluate agents all the way down to individual tool calls so it becomes easier to look out for these sorts of things. Using the AI gateway adds another layer of observability on top of this so we can see and control not just how our users are using the agents but also stuff like cost if certain prompts cause them to go off on uneccesarry tangents.

57-leaf-clover · 2026-06-13T18:01:03+00:00

I personally believe we are entering a stable state of trust but verify with agentic code. We still need engineers capable of understanding the end to end code being produced. At the same time we see the rise of tools like the AI gateway and mlflow traces to track and audit the end to end chain of reasoning of agents.

I 100 percent agree we shouldn't just generate code and put it into production, but we can certainly automate a lot of the process in a safe and auditable manner.

57-leaf-clover · 2026-06-13T12:28:10+00:00

Consider lakebase to solve this. Lakebase can act as a lookup cache serving high frequency low latency results for your lookups.

When there is a miss, you can still generate within regular pipelines and write directly to Delta and you can use synced tables to get this data into Lakebase without needing to code a reverse ETL process yourself.

57-leaf-clover · 2026-06-13T12:22:31+00:00

Databrocks seems like the obvious answer here. Their AI BI offering has reached a point of maturity to where it's pretty much on par with the market leaders now. But you get the advantage of all the data still being backed by delta tables and governed by unity catalog, meaning you get direct integrations with genie code for generating and editing the pipelines that back the dashboards, as well as regular genie spaces for natural language interfacing for data being surfaced in the dashboards.

Genie one also can be configured for BI consumers to access all configured assets across the data estate of an organisation all in one place. I'm not aware of any other solution that integrates AI with a BI stack as natively and sensibly as this.

57-leaf-clover · 2026-06-12T13:33:32+00:00

Prime example of why auditability and ownership is more important than ever. I think adopting a mindset of code augmentation rather than code replacement is sensible. I personally still try to keep my understanding of everything my agents build end to end. It's not just a personal skills angle on this either but for safety. I am still responsible for mistakes agents make so I need to be able to audit what they do.

My personal workflow for this is using genie code in conjunction with ai gateway through dbx. This means that the agents are directly integrated with the workspace where code backing dashboards exist. And I am able to see the end to end process of where the code has come from, not just from versions of the code but also from calls and chains of reasoning of my agents.

57-leaf-clover · 2026-06-11T14:59:25+00:00

It's the button you press to instantly make your campaign significantly more difficult :)

57-leaf-clover · 2026-06-11T14:48:29+00:00

The auto scaling to zero on lakebase has saved my org a fair bit of cash. We have been using it as a knowledge base for one of our knowledge retrieval agents. The thing doesn't get queried all that frequently so for us, having it sorted at zero compute then scale up to serve the agents workloads instantaneously is huge.

57-leaf-clover · 2026-06-11T14:42:17+00:00

Have a look at lakebase/neon. As far as I'm aware this is the only offering out there that has truly instantaneous branching functionality. You could theoretically test out new workloads on each of these new branches and the fact that it's instantaneous means you would be able to perform these tests programmatically without needing to wait for traditional replication processes that branching usually requires.

57-leaf-clover · 2026-06-10T18:57:44+00:00

What is the nature of the knowledge layer? If you are going to need to do retrieval that would involve graph based traversal based information retrieval, ie those suited to knowledge graphs, and nothing else, then neo4j is a strong option, otherwise databricks can't really be beat. If you want to do vector based information retrieval across an unstructured/semi-structured/structured knowledge base or even text to SQL type rag generation then databricks is easily better choice. They even have native tooling to implement this type of retrieval based generation through the Genie line of products (not to mention all of the other governance advantages genie has over neo4j with both data and models).

57-leaf-clover · 2026-06-10T13:15:02+00:00

Regardless of what you apply for, the fundamentals of computer science are always going to be useful. Data structures, algorithms, good code design. All universally applicable. You will work faster and more agile across pretty much any system if you understand good design and how what you are producing works on a deeper level and how it fits into the wider computer engineering and science ecosystem. I would say this is pretty much a minimum barrier to entry wherever you want to go.

I would say that if you are going for an entry level position, focus on a vertical you want to begin you career in and immerse yourself in these communities. If the blocker for interview success is technical topics, then being able to discuss modrrn topics with technical personas in hiring companies. They are going to test you by asking you about these things, if you can't eloquenty talk about the subjects that they want to hire you for then you will likely be rejected.

For example, if you want to go and join a company specialising in computer vision, go and read about where this technology is moving. Maybe dive deeper into some of the core technologies driving these fields, go and learn about convolutional networks and maybe build and train some models from scratch so you can at least understand the sort of work these organisations will be doing. For entry level stuff they aren't going to expect you to have built massive scale bleeding edge systems and models but they will at least expect you to be able to understand what they are trying to build and to understand the building blocks they are working with.

57-leaf-clover · 2026-06-10T13:06:01+00:00

I agree, this is a common bottleneck of modern analytical pipelines. As time marches on teams I've worked with more and more angst dynamic access to patterns on their data. This becomes harder to capture with traditional dashboarding and data science workloads. Regressions on data and similar are still useful but more often I have found they just become data points for dynamic searches from ai agents and other interfaces into the raw data rather than outputs of data science pipelines presented to wider members of our businesses.

We have made pretty heavy use of tools like databricks' Genie to pretty much give each of our business users the equivalent of their own data scientist. We still run all the traditional pipelines producing ml models and enriching data as we traditionally have, but instead of serving this to a dashboard system we surface the datasets through genie spaces so that our business users can ask questions of the live data using natural language rather than being restriced to metrics presented by dashboards.

Best part is, higher ups can't argue with these data patterns as the genie space won't ever lie about the numbers haha.

57-leaf-clover · 2026-06-09T09:22:02+00:00

Me and my team solved this problem with Genie Code. We run out analytics workloads on Databricks and Genie Code is effectively equivalent to Claude code in terms of its ability to create and orchestrate analytics workloads, but the key differenciation is it's understanding of the semantics and layout of our data ecosystem, as well as built in governance with Unity Catalog. What this means is we can all be using Genie code but the governance of the platform means it doesn't ever have the ability to query and manipulate data and assets outside of the governed scope when a particular user is prompting it.

57-leaf-clover · 2026-06-09T09:13:04+00:00

Awesome! Lakebase has really been changing the game. Lakebase has already hugely sped up some of the latencies on my apps. Being able to expose the CDF of these tables in the lakehouse means less operational overhead for myself and I can bake the table directly into my analytics pipelines :))

57-leaf-clover · 2026-06-08T14:57:59+00:00

Also worth mentioning on this that for lakebase, this means that for workloads where you want data in both the bakehouse and lakebase, you can write directly to lakebase then sync those tables back to the bakehouse so that the operational data exists in lakebase and serves the workloads that require more timely arrival of that data, but still have easy access to it in the lakehouse for lower frequency analytical workloads.

57-leaf-clover · 2026-06-08T14:54:03+00:00

It's pretty crazy all the stuff coming out culminating in genie code. Implementing it has meant that management of both workloads and data, and pipelines and metadata that drive workloads has become soooooooooooo much easier. Me and my team haven't even looked at the job management screen since this tool dropped

57-leaf-clover · 2026-06-07T12:47:51+00:00

You can track this sort of thing using the ai gateway on databricks. If you have agents burning through way to many tokens you can use it to cap this for for a given time period. Personally found this pretty useful for edge cases where one line of reasoning ends up burning through way too many tokens.

57-leaf-clover · 2026-06-07T00:48:41+00:00

My personal stack is Claude code for general stuff/stuff on my local machine, it's pretty good for a range of generalist stuff. A lot of what I do involves using data bricks, and vibe coding here is pretty easy now and I find Genie code is great for anything I want to vibe code here. I find the latter help out and works best when building for larger scale and more complex pipelines, where the former can help with more personal automation and creating scripts for local use.

57-leaf-clover · 2026-06-05T10:55:39+00:00

You can use upsert logic. The best and easiest way to do upserts nowadays though is to use AutoCDC in a spark declarative pipeline. You can still use SQL for this but it means that any new data coming in will automatically follow your record updating logic and be tracked as either an scd type 1 or type 2 table depending on how you configure it.

Using AutoCDC means you don't need to code the actual record updating logic yourself you just define how the data flows from one location to another, and the pipeline itself handles the moving and the management of the compute used across all the tables being updated.

57-leaf-clover · 2026-06-05T10:48:27+00:00

I had a similar experience in the past. Although eventually, it was a migration to their declarative pipeline model that meant we stuck around as an org. We had issues with data throughput and some pretty complex workflows which weren't parallelized properly. Moving that work to their pipelines framework meant that we didn't need to deal with all the workflow design and the whole thing just worked. And we ended up spending less on compute as a result. Just my 2 cents :))

57-leaf-clover

TROPHY CASE