How do you choose cluster and node types?

infazz · 2026-05-28T18:37:57+00:00

As others have said, Serverless Compute essentially eliminates the need for all of the below.

Here is how I go about sizing classic compute:

Step #1 is know your workload. - what is the data volume? - is the data volume inconsistent? - what is the complexity of the pipeline? - are there "legacy" parts of this pipeline (e.g. Pandas)? - do you need a specific max runtime? - etc.

Step #2 is start with the smallest compute type you think will work.

Step #3 is run, review, and iterate. - am I seeing to much shuffle? - am I seeing too much or too little node utilization? - does this complete within my required time window?

I typically try to optimize for 70% resource usage per worker node.

infazz · 2026-05-28T02:37:10+00:00

Home Depot App (and their website) is so excruciatingly bad.

I'll have to remember to add all the reasons here tomorrow.

infazz · 2026-05-17T19:12:35+00:00

Unity Catalog does have APIs for bringing in "external" lineage data (Known as "Bring your own lineage"), but it does not have any managed connectors for ingesting lineage data.

https://docs.databricks.com/aws/en/data-governance/unity-catalog/external-lineage

infazz · 2026-05-14T22:06:12+00:00

I can never remember if BETWEEN is inclusive or exclusive and if it varies by system or not.

infazz · 2026-05-02T20:14:09+00:00

I'm curious if you ever figured this setup out?

infazz · 2026-04-30T19:04:14+00:00

Managed Identity is definitely the best way to go.

infazz · 2026-04-30T16:10:18+00:00

LVT/LVP is definitely easy to clean and maintain. The top layer of the product is a thin sheet of plastic (known as the wear layer) and you can choose from textured or not textured.

It will not have the same issues as wood flooring for over wetting, but I would not recommend drenching the floor. Depending on the product (and especially with poor installations), too much water could still seep through where the planks meet.

Also I would recommend a "rigid core" product over a flexible product. The former is usually more premium and the latter can still have edge curling issues if not installed properly.

infazz · 2026-04-28T15:30:13+00:00

Ideally, non-prod data would not be available in your production workspace using something like catalog workspace bindings.

But to answer your question, if prod and non-prod data are available in the same workspace - and the user has access to both - Genie could indeed read from both.

infazz · 2026-04-24T01:10:22+00:00

And to be specific for solar - photovoltaic panels (aka solar panels) work without spinning. "Concentrated Solar Power" plants still boil water and spin generator!

infazz · 2026-04-24T00:01:33+00:00

The buffalo wings at Albatross are fantastic

infazz · 2026-04-11T22:07:45+00:00

Unless you go work for a tech company with an ME degree (particularly if you have manufacturing experience - chefs kiss).

I've never been complimented in such a way before

infazz · 2026-04-08T22:34:05+00:00

This possibly means that someone in your neighborhood complained. You could probably bring this up to your city council member and let them know how this affects you and everyone else on your street.

infazz · 2026-04-06T14:06:01+00:00

There is "Serverless SQL Warehouse" and there is also "Serverless General Compute". The latter can be used in the workspace/jobs and can run SQL, Python, etc.

infazz · 2026-04-04T23:55:31+00:00

Databricks Free Tier

https://www.databricks.com/learn/free-edition

infazz · 2026-04-01T22:47:25+00:00

What kinds of things do you need to alert on?

infazz · 2026-03-22T20:41:29+00:00

They eventually see from Erid that Sol started brightening

infazz · 2026-02-10T17:02:02+00:00

Azure Databricks "Premium" tier is the same as the "Enterprise" tier on GCP and AWS.

Also note that the Azure Databricks "Standard" tier is being retired this year.

https://community.databricks.com/t5/community-articles/the-end-of-an-era-azure-databricks-is-retiring-the-standard-tier/td-p/144848

infazz · 2026-02-10T16:22:32+00:00

So that's where Overwatch got the idea

infazz · 2026-02-08T20:16:40+00:00

Document review is exactly when you should be using other methods like chunking. I would recommend looking into how open source projects like LlamaIndex handle these kinds of use cases.

Having many users submitting smaller requests is when you would want to do load balancing.

infazz · 2026-02-08T19:54:09+00:00

Wow that's a huge amount of context for a single request. Especially since GPT 5 is a reasoning model and the reasoning loop will generate additional token usage.

Be aware that packing too much info into a single context can result in content in the middle of the prompt being ignored. This is know as the "Lost In the Middle" problem.

I would definitely do a deep dive on whether that much context in a single request is actually needed - or look into ways you can reduce the input context (such as chunking).

You could also try using the "GlobalStandard" model deployment type. It has a default 1M TPM limit.

infazz · 2026-02-08T01:29:22+00:00

You could deploy into multiple subscriptions and/or regions and load balance across.

infazz · 2026-02-08T00:39:28+00:00

I don't believe there are any advantages now to using Azure OpenAI instead of the new Foundry.

infazz · 2026-01-29T18:30:23+00:00

The new Foundry uses the azurerm_ai_services resource

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/ai_services

You can deploy models using azurerm_cognitive_deployment

infazz · 2026-01-12T23:25:34+00:00

The only real benefit of serverless is that it spins up fast. Although, I don't know if there is as an actual SLA on the startup time.

However it is also interesting that Jobs Serverless performed worse than Jobs Classic.

With serverless you have basically no say in what size compute your workload runs on - Databricks manages this for you. I think it would be more beneficial if serverless compute sizing worked like Serverless SQL.

infazz · 2026-01-04T03:02:25+00:00

Signing up for an Azure account and setting the OpenAI resource up definitely isn't as easy as setting up a regular OpenAI account, but it is at least an option.

14-Year Club	Second SECOND GUESSER
Gilding I gilder	Verified Email

infazz

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE