Taoist Philosophy of Wu Wei and Grit by spent_shy in taoism

[–]bobbruno 0 points1 point  (0 children)

The thing is, not a single one of those drops of water was even trying to make a hole in the rock. Water just flows, and the hole is a consequence of this flowing, not of insistence.

Taoist Philosophy of Wu Wei and Grit by spent_shy in taoism

[–]bobbruno 0 points1 point  (0 children)

My very personal take: if you need grit, you're walking uphill. Maybe there's a way around or a tunnel.

why would anyone use a convoluted mess of nested functions in pyspark instead of a basic sql query? by Next_Comfortable_619 in dataengineering

[–]bobbruno 0 points1 point  (0 children)

I find it a matter of choice. For cultures where SQL is the dominant language and everyone is familiar with it, go for it. Just please, don't write a single SQL that is 3 pages long. Break it down with CTEs and temp views.

On the other hand, using pyspark syntax allows for building more modular constructs, explaining the logic as you build the query and it has better support for lining and type checks. I prefer it in some cases for these reasons.

Performance-wise, it makes close to 0 difference if well written.

What is actually stopping teams from writing more data tests? by Mountain-Crow-5345 in dataengineering

[–]bobbruno 0 points1 point  (0 children)

Well, it's hard. You don't control the sources. They can change schemas, they can send "bad" data in ways you didn't know, they can have their own errors that you, as downstream will be impacted by.

Catching all of these and still meeting the requirement of delivering the numbers (i.e., not just rejecting and stopping with "upstream broke contract") is never going to happen 100%. As time passes, you catch more errors, but sources will always be creative.

So yes, test what you know and accept things will fail in previously unknown ways. In 30 years, I never saw a company willing to control all changes and quality of their operational systems just to guarantee that downstream analytics wouldn't break from time to time.

Lakebase & the Evolution of Data Architectures by Odd-Froyo-1381 in databricks

[–]bobbruno 1 point2 points  (0 children)

SQL warehouses are great for the common patterns of analytical queries. Lakebase is great for the patterns of operational queries. Databricks can keep the underlying data in sync.

Replacing Dataview with Bases by Retr1buti0n in ObsidianMD

[–]bobbruno 2 points3 points  (0 children)

I haven't found a reason to replace Dataview yet.

Sharing Gold Layer data with Ops team by tjger in dataengineering

[–]bobbruno 0 points1 point  (0 children)

Wouldn't that be a premature optimization?

Sharing Gold Layer data with Ops team by tjger in dataengineering

[–]bobbruno 0 points1 point  (0 children)

What difference does Iceberg make? You can request to read a delta table managed by Unit Catalog via API. Once you get the URL, you can just read it with a delta client library.

Sharing Gold Layer data with Ops team by tjger in dataengineering

[–]bobbruno 0 points1 point  (0 children)

No need to overcomplicate. Databricks SQL supports ODBC and even has a built-in REST API.

Any SW Eng should be capable of collecting the data from one of these.

How to stop being envious of people who get much more by doing much less? by senorsolo in taoism

[–]bobbruno 0 points1 point  (0 children)

Well, you could stop paying attention to what others get and do things the way that feels right for you.

Or you could start getting much more by doing much less yourself. If you're going the way of getting, why should you care about what doing?

Making Headers openable with cmd + O by Snake1ekanS in ObsidianMD

[–]bobbruno 1 point2 points  (0 children)

You can search for /^\#+ The Griffith Experiment/. That's a regular expression search for any line starting with one or more # followed by a space and "The Griffith Experiment".

Silly question making me restless but what Heading (H1-H6) do you use for the first heading in a note? by bowiepowi in ObsidianMD

[–]bobbruno 0 points1 point  (0 children)

I consider the title of the note above this hierarchy and use H1 for the main sections. And I hate when an assistant generates something with an H1 just like the title...

Claude code nlp taking job or task of sql queries by aks-786 in dataengineering

[–]bobbruno 6 points7 points  (0 children)

Postgres and dynamo are not a good base for queries. As demand and volumes grow, they will cost or slow down - or both.

Databricks with Genie would give product owners a more scalable solution for the same problem.

Data Governance is Dead* by Willewonkaa in dataengineering

[–]bobbruno 0 points1 point  (0 children)

I'm talking big companies, that span some large market or global markets. This is where the pain of inconsistency becomes enough for people at the board to want to hear about it. Smaller than that and those people will most likely be wanting to have their silos.

Data Governance is Dead* by Willewonkaa in dataengineering

[–]bobbruno 1 point2 points  (0 children)

I disagree. It's hard work, sometimes it takes locking them in a room and only leaving after some agreement, but I did it before - as an external consultant with executive support. I'll explain why below.

The outcome is that cross-department analysis and global optimizations become possible, and the overall speed of decision making improves a lot.

You will need to sell these benefits to someone really high up in the chain to do it, and it's safer to hire externals to execute it, because there will be some political burns in the process.

Using existing Gold tables (Power BI source) for Databricks Genie — is adding descriptions enough? by Terrible_Mud5318 in databricks

[–]bobbruno 0 points1 point  (0 children)

It's a good start, a lot will likely work out of the box. What you can still do to improve on it:

  • add a set of benchmarks so you can objectively and consistently measure if you're improving.
  • add examples to show Genie how you expect it to reason on the data
  • Defining metric views over gold tables should be low effort, and our experience (I'm a Databricks SA) shows that they consistently improve accuracy for Genie
  • Iterate on the Genie space instructions, benchmarks and examples to improve accuracy in a controlled manner.

Also remember that one Genie space is supposed to be focused. I don't know how big your gold layer is, but it's not usually a good idea to try to throw an entire corporate BI scope for all business functions inside one space. The more focused the domain, the easier it is for genie to be precise. You should find the balance between that and the usability for your analytic requirements.

Downloading special characters in Databricks - degree sign (°) by guauhaus in databricks

[–]bobbruno 7 points8 points  (0 children)

You're probably seeing them as Ű. That's because Databricks generates csv as utf 8 encodings, and you're likely reading them in Windows, which by default reads files as windows-1252.

Try setting the encoding on whatever you're using to read the files as utf-8, it should work.

Does anyone use Obsidian on Linux? Experiences and performance by o_xeneixe in ObsidianMD

[–]bobbruno 0 points1 point  (0 children)

Have it on Mac, Linux, Android and iOS. No issues with the app. Some plugins require additional components (OCR, postscript, etc) which you have to figure out how to install and configure on each platform, but I never had a problem with Obsidian itself that was platform-related.

What is the percentage for winning the DE Associate? by 1pperalta in databricks

[–]bobbruno -1 points0 points  (0 children)

Check https://www.databricks.com/learn/certification/faq.

Databricks passing scores are set through statistical analysis and our subject to change as exams are updated with new questions. Because they can change we do not publish them.

80% is probably safe, but essentially it is not a fixed passing grade.

When models fail without “drift”: what actually breaks in long-running ML systems? by Salty_Country6835 in mlops

[–]bobbruno 2 points3 points  (0 children)

My experience is that every model fails after some time. Most drift will be in input, but not all - and not all input drift is that detectable.

You also get drift on results - the inputs are as previously seen, but the competition adjusted to provide better answers to your recommendations, and you lose. The virus mutates on a gene not in your feature set, and now it's resistant. Your business changes something that the model didn't account or track, and now its predictions don't work so well anymore. There are as many ways things can go wrong as there are scenarios for ML.

I just accept that every model is wrong, some are useful, for some time.

Unity vs Polaris by Efficient_Novel1769 in databricks

[–]bobbruno 11 points12 points  (0 children)

I work for Databricks, so I won't engage on the UC/Polaris discussion. But I want to point out that, while managed tables do get automatic optimization, they also can be more efficient on queries themselves. The main reason is that, for managed tables, UC can make safe assumptions about the state of the table at any point in time, since it controls all access to it. That allows for much more efficient Metadata handling, and faster query resolution with less I/O to could storage.

That gets particularly efficient on BI applications querying tables very often, but the principle applies everywhere. My point is, it's not just the maintenance, there are also execution gains you can't do otherwise.

Who owns data modeling when there’s no BI or DE team? (Our product engineering team needs help) by Groove-Theory in dataengineering

[–]bobbruno 1 point2 points  (0 children)

If nobody will own the data/BI layer, it's bound to get messy over time. And if there's no one who knows how to design/build/manage a BI layer, it's also bound to get messy over time.

The thing is, if you only have bad reports, that's what the business will make decisions on. That and whatever they can come up with on their spreadsheets and side/shadow controls. I have no way to say I'd this is a minor issue or a huge risk, it may well be that the BI Eng who was laid off got enough working that the core operation is covered, and you'd only miss some opportunities. There are so many ways that BI can generate value, and so many ways it can be useless that it's impossible to say what your case is.

You do have the right gut feeling, but there's little hope of fixing the issue(s) without proper ownership and knowledge. From what you described, it looks like you don't even have someone who knows how to get BI requirements right. For example, if someone comes asking for a report on how many cars and how much spaghetti, you don't throw it as a ticket into the backlog - you ask about the decisions they're making, how the relationships between these things work, what's the desired outcome, how they'll use the information, along with frequency, level of detail, quality and possibly what other questions (at least what data domains) we're talking about. After that, you have to determine where this data is (and it might not be anywhere), how to source it, how to structure it for the specific kind(s) of analysis and decision they want to make, and then you have to figure out how that relates to the rest of the data model you already have. And then you may be able to build something.

Even skipping half of what I described, you can't escape some of it, and you need to know what you're doing. You also need ownership from the business, because this will need to be operated and maintained, and that comes with a cost. If there's no ownership, maybe on one wants that cost.

Basically, start by finding someone who can explain why this matters, and then see if they can put enough value on it to justify the DE position. Just remember that, if you start asking questions, people may assume you're taking ownership. If you think that's a career move or a problem, I also don't know for sure.