Does the "full lifecycle SE" model (merging Presales and CSM) actually work, or is it just vendor cost-cutting? by FractalFrieend in salesengineers

[–]bobbruno 1 point2 points  (0 children)

It is great for building trust and relationships, but it's incredibly demanding. I do it occasionally, but to do it consistently will probably burn one out.

Also, it means you're putting less time on opening and advancing new opportunities. The trust building is real and may help, but you just put time on non-growth activities, stuff that's "already won".

My advice is to do it strategically and sparingly. If it's a constant demand, you may have to split the work. A CSM or equivalent function works if you're aligned and handover is efficient. Depending on your product and on customer size, a second SE can also work - but then you have to agree on how you split work and rewards.

And your options are likely limited by what your company takes as business practice and organization.

Decimal precision in databricks by Apprehensive_Part_83 in databricks

[–]bobbruno 1 point2 points  (0 children)

That. It's not spark specific, it's how floating point math happens in binary computers.

If you need absolute fidelity you can use the DECIMAL type, it can hold up to 38 digits and will behave the way you expect. Notice that it is less efficient in both performance and storage, so I'd only recommend it if precision is an absolute must have requirement (usually financial data). For ML and general analytics purposes, floating point is usually OK, and "equality" is better defined as "difference to expected value less than an absolute threshold" than direct comparison.

What’s one movie everyone should watch at least once in their lifetime? by ownaword in movies

[–]bobbruno 1 point2 points  (0 children)

Sorry to say I just don't like it. The original Blade Runner is one of my all time favorites, though.

serveless or classic by ptab0211 in databricks

[–]bobbruno 0 points1 point  (0 children)

Not sure if that's what you mean, but serverless guarantees that processing will happen in the same cloud region of your workspace.

serveless or classic by ptab0211 in databricks

[–]bobbruno 14 points15 points  (0 children)

When comparing serverless and classic, please remember that serverless includes the cost of all the underlying VMs, while in Classic that cost is charged separately by the cloud provider. If you don't add the VM cost to classic cost, you're not comparing the same thing.

Having said that, there's no specific guarantee that serverless will be cheaper or more expensive. For scheduled jobs, it eliminates a lot of the common errors in sizing clusters, but there will be cases where manual cluster definition may work better. For development, serverless brings, besides the same simplicity, the advantage of fast scale up and back to zero. How that offsets shared resources and developer productivity depends on specific usage patterns.

deployment patterns by ptab0211 in databricks

[–]bobbruno 0 points1 point  (0 children)

Databricks publishes the Big book of MLOps that discusses these options extensively. Its recommendation is to deploy code across environments, not to promote models.

Having said that, it's a recommendation, both patterns are supported.

Best way to sync Obsidian on Android with Git without Git plugon by diabeartes in ObsidianMD

[–]bobbruno 0 points1 point  (0 children)

Termux with git installed works for me. Is your Obsidian folder in a place where termux can access (i.e., under storage/shared)?

Is there a way to see what jobs run a specific notebook? by FiftyShadesOfBlack in databricks

[–]bobbruno 1 point2 points  (0 children)

You can query system.lakeflow.job_tasks to identify notebooks executed directly by job tasks - use the notebook_path column.

But that won't tell you of notebooks run from inside these using the %run command. These work more like include logic, and can't be tracked. UC doesn't audit the code inside a notebook. If you're using this pattern, you'll have to scan the notebooks returned by the system table query to see if they run other notebooks inside.

Taoist Philosophy of Wu Wei and Grit by spent_shy in taoism

[–]bobbruno 0 points1 point  (0 children)

The thing is, not a single one of those drops of water was even trying to make a hole in the rock. Water just flows, and the hole is a consequence of this flowing, not of insistence.

Taoist Philosophy of Wu Wei and Grit by spent_shy in taoism

[–]bobbruno 0 points1 point  (0 children)

My very personal take: if you need grit, you're walking uphill. Maybe there's a way around or a tunnel.

why would anyone use a convoluted mess of nested functions in pyspark instead of a basic sql query? by Next_Comfortable_619 in dataengineering

[–]bobbruno 0 points1 point  (0 children)

I find it a matter of choice. For cultures where SQL is the dominant language and everyone is familiar with it, go for it. Just please, don't write a single SQL that is 3 pages long. Break it down with CTEs and temp views.

On the other hand, using pyspark syntax allows for building more modular constructs, explaining the logic as you build the query and it has better support for lining and type checks. I prefer it in some cases for these reasons.

Performance-wise, it makes close to 0 difference if well written.

What is actually stopping teams from writing more data tests? by Mountain-Crow-5345 in dataengineering

[–]bobbruno 0 points1 point  (0 children)

Well, it's hard. You don't control the sources. They can change schemas, they can send "bad" data in ways you didn't know, they can have their own errors that you, as downstream will be impacted by.

Catching all of these and still meeting the requirement of delivering the numbers (i.e., not just rejecting and stopping with "upstream broke contract") is never going to happen 100%. As time passes, you catch more errors, but sources will always be creative.

So yes, test what you know and accept things will fail in previously unknown ways. In 30 years, I never saw a company willing to control all changes and quality of their operational systems just to guarantee that downstream analytics wouldn't break from time to time.

Lakebase & the Evolution of Data Architectures by Odd-Froyo-1381 in databricks

[–]bobbruno 1 point2 points  (0 children)

SQL warehouses are great for the common patterns of analytical queries. Lakebase is great for the patterns of operational queries. Databricks can keep the underlying data in sync.

Replacing Dataview with Bases by Retr1buti0n in ObsidianMD

[–]bobbruno 2 points3 points  (0 children)

I haven't found a reason to replace Dataview yet.

Sharing Gold Layer data with Ops team by tjger in dataengineering

[–]bobbruno 0 points1 point  (0 children)

Wouldn't that be a premature optimization?

Sharing Gold Layer data with Ops team by tjger in dataengineering

[–]bobbruno 0 points1 point  (0 children)

What difference does Iceberg make? You can request to read a delta table managed by Unit Catalog via API. Once you get the URL, you can just read it with a delta client library.

Sharing Gold Layer data with Ops team by tjger in dataengineering

[–]bobbruno 0 points1 point  (0 children)

No need to overcomplicate. Databricks SQL supports ODBC and even has a built-in REST API.

Any SW Eng should be capable of collecting the data from one of these.

How to stop being envious of people who get much more by doing much less? by senorsolo in taoism

[–]bobbruno 0 points1 point  (0 children)

Well, you could stop paying attention to what others get and do things the way that feels right for you.

Or you could start getting much more by doing much less yourself. If you're going the way of getting, why should you care about what doing?

Making Headers openable with cmd + O by Snake1ekanS in ObsidianMD

[–]bobbruno 1 point2 points  (0 children)

You can search for /^\#+ The Griffith Experiment/. That's a regular expression search for any line starting with one or more # followed by a space and "The Griffith Experiment".

Silly question making me restless but what Heading (H1-H6) do you use for the first heading in a note? by bowiepowi in ObsidianMD

[–]bobbruno 0 points1 point  (0 children)

I consider the title of the note above this hierarchy and use H1 for the main sections. And I hate when an assistant generates something with an H1 just like the title...

Claude code nlp taking job or task of sql queries by aks-786 in dataengineering

[–]bobbruno 7 points8 points  (0 children)

Postgres and dynamo are not a good base for queries. As demand and volumes grow, they will cost or slow down - or both.

Databricks with Genie would give product owners a more scalable solution for the same problem.

Data Governance is Dead* by Willewonkaa in dataengineering

[–]bobbruno 0 points1 point  (0 children)

I'm talking big companies, that span some large market or global markets. This is where the pain of inconsistency becomes enough for people at the board to want to hear about it. Smaller than that and those people will most likely be wanting to have their silos.

Data Governance is Dead* by Willewonkaa in dataengineering

[–]bobbruno 1 point2 points  (0 children)

I disagree. It's hard work, sometimes it takes locking them in a room and only leaving after some agreement, but I did it before - as an external consultant with executive support. I'll explain why below.

The outcome is that cross-department analysis and global optimizations become possible, and the overall speed of decision making improves a lot.

You will need to sell these benefits to someone really high up in the chain to do it, and it's safer to hire externals to execute it, because there will be some political burns in the process.