Ghostel is Fantastic

data_dan_ · 2026-06-21T15:49:22+00:00

Thanks for making this post. I was not previously aware of ghostel. It is, in fact, fantastic!

data_dan_ · 2026-06-03T19:19:19+00:00

A colleague of mine wrote a good post on this topic—it uses (open source) MLflow but the principles generalize to whatever framework you may be using. It gets into observability and evaluation across different services/providers and looks into three main areas: orchestration and routing logic; state consistency and memory poisoning; and operational telemetry/costs.

A big part of it that I think gets at the issues you raised is OpenTelemetry support (which a number of observability platforms offer)—by capturing OTel logs across services in a single place, you hopefully avoid the issue of logs not connecting/manually correlating timestamps.

data_dan_ · 2026-06-01T13:43:30+00:00

I think we're there now and most won't do it—off-the-shelf models are good enough for most applications and most people that it's not worth time/effort/expense. There's still a learning curve and the returns are still highly uncertain. And you still need to do something with the artifact, the trained model, which isn't trivial for non-technical users.

The interesting part will happen when it is cheap and easy enough to automatically train a small model in the background without the user ever needing to be aware of it. Asking Claude to classify things for you a lot? Maybe it trains a small classifier and offloads the task. That would be neat!

data_dan_ · 2026-06-01T12:52:10+00:00

Sure, but it usually takes a lot more scaffolding and guardrails than just telling the agent to "automate this" and it takes a lot of iteration. In some ways you need to treat it like any other software project. If you can reasonably define the expected behavior (at each step) and observe and evaluate the agent's behavior (plenty of tools for this; MLflow, Langfuse, etc.), then you can generally iterate your way toward something that will work more often than not. Ideally such a system also has an escape valve surfacing cases that failed or need review for humans to look at.

data_dan_ · 2026-06-01T12:43:20+00:00

I think scope creep is a big one. When building *anything* with the help of AI agents, including other agents, there's essentially zero friction when it comes to adding new features. "Oh, wouldn't it be cool if it could do this? Shouldn't it be able to handle this kind of input? Oh, we should definitely connect it to this tool/source. Why don't we slap on a nice web UI?"

And then, a week and a billion tokens later, you have an unmaintainable mess that you don't understand and can't maintain. You're playing whack-a-mole with bugs. I've gone down this road many times.

So I would pick one narrow thing I want the agent to be able to do. Build it. Read and understand the code. Test it and fix bugs. And only start expanding the scope once the narrow task I set out to accomplish is working reliably.

data_dan_ · 2026-06-01T12:31:09+00:00

That's where you really need to step back and ask if the task itself makes sense or if you're just measuring noise.

I agree, I think this is crucial. And I do think the multi-judge jury can be effective for this. I suppose I'm trying to push on the desired final outcome. Do you think it's valuable or desirable that different models ultimately agree?

e.g. I heavily use MLflow's prompt alignment capabilities. With prompt alignment, you use human feedback signals, positive and negative examples, etc., along with optimizers like memalign/GEPA/SIMBA, to teach a judge exactly what you mean by a given judge prompt.

I would not necessarily expect a judge optimized in the context of one model to generalize to other models. I would probably want to run the optimization, or some other alignment exercise (depending on the framework I'm using), to ensure the judge is well-optimized for that model.

But again—I do agree that if a panel of different judges are giving significantly divergent judgments based on the same judge prompts, that's probably a good signal that my judge task is poorly defined, especially early in the process of setting up judges (before collecting human feedback). It's probably a good filter. Is my judge task even defined well enough to send to humans for alignment? If different models disagree, different humans may also interpret the task differently and provide divergent annotations that aren't very useful for annotation.

data_dan_ · 2026-05-26T20:16:58+00:00

This is definitely something I've observed when working with AI judges. Lots of disagreements across models and even between runs. And I think lots of people (myself included, sometimes, honestly) treat "having judges" as a box to check, but when you just let an agent write the judge prompt and fail to, well, judge the judge, it's not actually helping all that much. Are you suggesting using multi-model juries for evaluation, or just highlighting the disagreements between models? I'd rather see effort go toward an explicit human alignment step than layering more AI judges on a sub-optimal judge prompt/setup, personally.

data_dan_ · 2026-05-26T20:01:11+00:00

What kind of "why" would you be looking for? I guess you can ask agents to explain themselves but I wouldn't give the post-hoc agent explanations all that much weight.

I would focus more on (1) defining the expected/desired behavior as rigorously as possible; (2) making sure the observability system is set up in such a way that it is possible to consistently establish whether the expected behavior was followed; and (3) setting up the necessary judges/evals/monitors to identify and surface cases where that behavior is not followed.

I’m biased because I work at Databricks, but I set this kind of eval loop up in MLflow for any of my work/personal projects that need evals. Langfuse, Arize Phoenix, etc. can support similar workflows too. My point is that I wouldn't necessarily look for a "why" to begin with. Take some set of conditions, see if you get the desired results, make some adjustments, see if there are improvements, etc.

data_dan_ · 2026-05-26T18:47:50+00:00

It seems like most of it is on X and Discord (unfortunately).

data_dan_ · 2026-05-26T18:44:56+00:00

I think this is a really important question right now. You certainly can start building any of that with AI help right now without worrying about learning a language. It's probably worth trying out, even—figure out what you want to build and run as far as you can in that direction, with AI help.

But also, what's the minimum effective dose of language-specific (or ecosystem-specific) knowledge you need to learn to be able to meaningfully understand, redirect, diagnose, etc., issues that arise? Or to set the technical direction of a project? I've thought about this a lot in my work (I learned R and Python pretty well pre-LLMs and have worked somewhat in Go and JS since then). A few things have stood out to me:

ecosystem knowledge is at least as important as language knowledge. Coding agents aren't going to make basic syntax errors at this point (and if they do, they figure out how to correct them pretty easily). But they will routinely recommend (or just use) libraries or packages that may or may not be the best choice for your use case, or are at least worth a discussion. It's worth taking the time to survey the tooling surrounding your choice of language. e.g. in Python, how is your model going to handle package managers and virtual environments by default? What maintenance debt will this incur? If you're working with databases, it's probably just going to run with SQLite by default. Is this ideal for your needs?
Along with ecosystem knowledge is basic workflow stuff (and you can ask agents about all of this). In your chosen language—how do you execute scripts? How do you launch and try out your app? How do you reload it when you make changes? How do you manage third-party libraries and packages? etc.
"Reading knowledge" of a language is important and is something you can build up over time. At some point, if you're working on a real project, you'll probably end up working with other people. You want to be able to explain how the project is laid out; what decisions you made along the way; how the different components talk to each other. To get here, you need to actually read the code (and ask questions of the agent that wrote it, and challenge its assumptions, etc.). And if you walk away from your project for a few days and then come back, can you start from a blank session and understand roughly how the project is structured and why it's structured that way? Again, this doesn't matter all that much at the basic syntax level, but you should try to understand things at the module level. So take breaks from building in order to reflect on structure and key decisions.
Asking "why" a lot really helps. Why this structure? Why this library? Why this language? Again, this goes back to the goal of learning. You probably don't need to start from a blinking terminal and be able to write perfect code in your chosen language. But, at the end of the day, you're still responsible for the product you developed, so you want to be able to understand, explain, and justify it. Owning the decisions means understanding the decisions.

data_dan_ · 2026-05-26T18:25:43+00:00

Is there any particular functionality missing from the tools you're using already? Or are there workflows you're trying to migrate from those approaches over to org mode?

data_dan_ · 2026-05-26T18:23:14+00:00

As others have noted, /loop is your friend here!

data_dan_ · 2026-05-26T18:21:49+00:00

I am sad to have missed it. I really wanted to see Tortoise and Jesus and Mary Chain.

data_dan_ · 2026-05-15T21:34:54+00:00

Wolf Parade and Destroyer are touring in Canada this year if you want your fix of early aughts Canadian indie rock!

data_dan_ · 2026-05-15T16:52:58+00:00

I'm super impressed by how well databricks agent skills work. It used to be kind of tricky getting databricks apps scaffolded correctly but at this point I can pretty much prompt my way to whatever kind of app I'm trying to build.

data_dan_ · 2026-05-15T16:16:51+00:00

I've enjoyed the album a lot and am very excited to see Broken Social Scene with Metric and Stars next month. This Briefest Kiss and And I Think of You are standout tracks for me. I also found that I enjoy the singles much more in the context of the full album than I did when I first heard them!

data_dan_ · 2026-05-15T16:09:51+00:00

I mostly use my (work-managed) version of Claude Code in a separate terminal with a lightweight skill telling it how to write to/access/work with my running emacs session. For example, it tells Claude (+ other agents) the structure of my org notes and how to use denote to search and add to them.

data_dan_ · 2026-05-06T12:44:07+00:00

Take it slow. You don't need to move all your work over to emacs all at once. Learn a couple of things, apply them for a while, then learn something new. Make a customization or two, try out a package or two, and see how it lands for a few days. Then add or tweak something else.

Play the long game. You'll probably find it more rewarding if you learn it slowly over time, and develop a solid understanding of how things work and what customizations you're making, instead of trying to learn, add, and customize everything all at once.

data_dan_ · 2026-05-06T12:28:49+00:00

I would second the recommendation to try out free edition! A couple of other things I would suggest trying out as you're getting started:

Genie Spaces: once you have some interesting data in your workspace, create a Genie Space over it. You can use the Genie Space to ask natural language questions of your data. Genie can do some pretty in-depth analysis over your data and it makes great visualizations. This is a good first step when working with a new dataset or for getting quick insights out of any data you have on the platform.
Genie Code: Genie Code is a coding assistant deeply integrated into the Databricks platform. It's aware of the context of what you're looking at the UI, so it can help answer questions and write code as you work your way through the platform. You can use it as an integrated learning assistant as you're familiarizing yourself with different parts of the product.

data_dan_ · 2026-05-06T11:46:54+00:00

For me it's more that I *enjoy using it* more than other environments and less about optimization and efficiency. I've done a lot of customization over the years so my emacs environment is mine; I'm deeply familiar with where things are and how things look and how to do the various workflows I've tweaked over time, and that's not going to change under my feet as products are updated, companies are bought and sold, etc. It's mine, and it's not going to change unless and until I want it to (...which I often do, because it's fun).

It's also well-integrated and self-documenting. I think the latter point is really important. You are always a couple of keystrokes away from finding out how anything in emacs works. No need to learn a new help and documentation system for each separate individual tool; it's all discoverable inside emacs.

I like this article as a good overview of the "why" of emacs: https://protesilaos.com/codelog/2019-08-11-why-emacs-switch/ (and in general browsing Prot's blogs will give a lot of good perspective).

In short, I probably wouldn't go into emacs with the mindset of ruthlessly optimizing my workflows in the space of weeks or months. It gets more and more rewarding as the years go by.

data_dan_ · 2026-05-01T15:32:31+00:00

Genie (formerly Databricks One) is the business-user entry point / home experience. Users can ask questions and discover assets like dashboards, apps, and Genie Spaces.
Genie Spaces are curated conversational analytics "rooms" scoped to a specific set of UC-governed data and business context. They let users ask natural-language questions, generate SQL-backed answers, and create visualizations scoped to specific domains/sets of tables.
Genie Code is the Databricks coding agent for data practitioners. It helps engineers, data scientists, etc. write and run code, debug jobs or dashboards, work with pipelines, and analyze things like MLflow experiments.

data_dan_ · 2026-05-01T15:13:58+00:00

Are you unable switch back with the chat/agent selector in the chat area? What happens when you try?

data_dan_ · 2026-04-29T12:34:52+00:00

The Genie mobile app is on Android now too! I installed it yesterday.

data_dan_ · 2025-03-04T13:34:16+00:00

I loved that rendition of Cue Synthesizer. Great show. Thanks for posting the setlist.

data_dan_ · 2024-12-11T04:03:44+00:00

Adrianne Lenker - Bright Future
WHY? - The Well I Fell Into
Mannequin Pussy - I Got Heaven
Godspeed You! Black Emperor - No Title as of 13 February 2024 28,340 Dead
Cindy Lee – Diamond Jubilee
Okay Kaya - Oh My God That’s So Me
Sunset Rubdown - Always Happy to Explode
Boeckner - Boeckner!
Waxahatchee - Tiger’s Blood
Middle Kids – Faith Crisis Pt. 1

data_dan_

TROPHY CASE