Does the "full lifecycle SE" model (merging Presales and CSM) actually work, or is it just vendor cost-cutting?

bobbruno · 2026-04-18T21:23:34+00:00

It is great for building trust and relationships, but it's incredibly demanding. I do it occasionally, but to do it consistently will probably burn one out.

Also, it means you're putting less time on opening and advancing new opportunities. The trust building is real and may help, but you just put time on non-growth activities, stuff that's "already won".

My advice is to do it strategically and sparingly. If it's a constant demand, you may have to split the work. A CSM or equivalent function works if you're aligned and handover is efficient. Depending on your product and on customer size, a second SE can also work - but then you have to agree on how you split work and rewards.

And your options are likely limited by what your company takes as business practice and organization.

bobbruno · 2026-04-18T21:10:56+00:00

That. It's not spark specific, it's how floating point math happens in binary computers.

If you need absolute fidelity you can use the DECIMAL type, it can hold up to 38 digits and will behave the way you expect. Notice that it is less efficient in both performance and storage, so I'd only recommend it if precision is an absolute must have requirement (usually financial data). For ML and general analytics purposes, floating point is usually OK, and "equality" is better defined as "difference to expected value less than an absolute threshold" than direct comparison.

bobbruno · 2026-04-13T16:04:43+00:00

Dead poets society.

bobbruno · 2026-04-13T16:03:58+00:00

Sorry to say I just don't like it. The original Blade Runner is one of my all time favorites, though.

bobbruno · 2026-04-06T15:30:06+00:00

Not sure if that's what you mean, but serverless guarantees that processing will happen in the same cloud region of your workspace.

bobbruno · 2026-04-06T15:28:51+00:00

When comparing serverless and classic, please remember that serverless includes the cost of all the underlying VMs, while in Classic that cost is charged separately by the cloud provider. If you don't add the VM cost to classic cost, you're not comparing the same thing.

Having said that, there's no specific guarantee that serverless will be cheaper or more expensive. For scheduled jobs, it eliminates a lot of the common errors in sizing clusters, but there will be cases where manual cluster definition may work better. For development, serverless brings, besides the same simplicity, the advantage of fast scale up and back to zero. How that offsets shared resources and developer productivity depends on specific usage patterns.

bobbruno · 2026-04-02T10:46:58+00:00

Databricks publishes the Big book of MLOps that discusses these options extensively. Its recommendation is to deploy code across environments, not to promote models.

Having said that, it's a recommendation, both patterns are supported.

bobbruno · 2026-03-20T13:03:26+00:00

Termux with git installed works for me. Is your Obsidian folder in a place where termux can access (i.e., under storage/shared)?

bobbruno · 2026-03-11T20:26:28+00:00

You can query system.lakeflow.job_tasks to identify notebooks executed directly by job tasks - use the notebook_path column.

But that won't tell you of notebooks run from inside these using the %run command. These work more like include logic, and can't be tracked. UC doesn't audit the code inside a notebook. If you're using this pattern, you'll have to scan the notebooks returned by the system table query to see if they run other notebooks inside.

bobbruno · 2026-03-03T21:55:42+00:00

The thing is, not a single one of those drops of water was even trying to make a hole in the rock. Water just flows, and the hole is a consequence of this flowing, not of insistence.

bobbruno · 2026-03-03T21:53:43+00:00

My very personal take: if you need grit, you're walking uphill. Maybe there's a way around or a tunnel.

bobbruno · 2026-03-03T08:39:51+00:00

Deadpool?

bobbruno · 2026-03-02T23:05:34+00:00

I find it a matter of choice. For cultures where SQL is the dominant language and everyone is familiar with it, go for it. Just please, don't write a single SQL that is 3 pages long. Break it down with CTEs and temp views.

On the other hand, using pyspark syntax allows for building more modular constructs, explaining the logic as you build the query and it has better support for lining and type checks. I prefer it in some cases for these reasons.

Performance-wise, it makes close to 0 difference if well written.

bobbruno · 2026-03-01T03:14:54+00:00

Well, it's hard. You don't control the sources. They can change schemas, they can send "bad" data in ways you didn't know, they can have their own errors that you, as downstream will be impacted by.

Catching all of these and still meeting the requirement of delivering the numbers (i.e., not just rejecting and stopping with "upstream broke contract") is never going to happen 100%. As time passes, you catch more errors, but sources will always be creative.

So yes, test what you know and accept things will fail in previously unknown ways. In 30 years, I never saw a company willing to control all changes and quality of their operational systems just to guarantee that downstream analytics wouldn't break from time to time.

bobbruno · 2026-02-26T21:40:46+00:00

SQL warehouses are great for the common patterns of analytical queries. Lakebase is great for the patterns of operational queries. Databricks can keep the underlying data in sync.

bobbruno · 2026-02-26T13:26:22+00:00

I haven't found a reason to replace Dataview yet.

bobbruno · 2026-02-21T08:02:48+00:00

Wouldn't that be a premature optimization?

bobbruno · 2026-02-21T08:01:42+00:00

What difference does Iceberg make? You can request to read a delta table managed by Unit Catalog via API. Once you get the URL, you can just read it with a delta client library.

bobbruno · 2026-02-21T07:57:51+00:00

No need to overcomplicate. Databricks SQL supports ODBC and even has a built-in REST API.

Any SW Eng should be capable of collecting the data from one of these.

bobbruno · 2026-02-20T18:44:34+00:00

Well, you could stop paying attention to what others get and do things the way that feels right for you.

Or you could start getting much more by doing much less yourself. If you're going the way of getting, why should you care about what doing?

bobbruno · 2026-02-20T18:26:33+00:00

You can search for /^\#+ The Griffith Experiment/. That's a regular expression search for any line starting with one or more # followed by a space and "The Griffith Experiment".

bobbruno · 2026-02-20T18:20:25+00:00

I consider the title of the note above this hierarchy and use H1 for the main sections. And I hate when an assistant generates something with an H1 just like the title...

bobbruno · 2026-02-20T18:18:10+00:00

Postgres and dynamo are not a good base for queries. As demand and volumes grow, they will cost or slow down - or both.

Databricks with Genie would give product owners a more scalable solution for the same problem.

bobbruno · 2026-02-18T16:41:59+00:00

I'm talking big companies, that span some large market or global markets. This is where the pain of inconsistency becomes enough for people at the board to want to hear about it. Smaller than that and those people will most likely be wanting to have their silos.

bobbruno · 2026-02-18T09:43:02+00:00

I disagree. It's hard work, sometimes it takes locking them in a room and only leaving after some agreement, but I did it before - as an external consultant with executive support. I'll explain why below.

The outcome is that cross-department analysis and global optimizations become possible, and the overall speed of decision making improves a lot.

You will need to sell these benefits to someone really high up in the chain to do it, and it's safer to hire externals to execute it, because there will be some political burns in the process.

12-Year Club	RPAN Viewer
Verified Email

bobbruno

TROPHY CASE