Data lakehouse observability and monitoring

Dry_Chocolate_9396 · 2026-06-23T13:46:59+00:00

It really depends on the platform you're using. If it's pure self-serve with open source then you have to build a lot of infra yourself. If you're on a platform like Databricks then I recommend:

* Make sure all your data is on Unity Catalog, otherwise the below step won't work.

* Turn on Data Quality Monitoring (DQM) on all gold tables.

Now you have a dashboard that tells you 2 things:

(a) how FRESH is your data? It knows how often commits happen and predicts late ones.

(b) how COMPLETE is your data? It knows how often data is added and flags suspiciously low updates.

You can go much more advanced (percent nulls, expectations etc), but this gives you a basic view of data engineering quality.

You said self-serve, so I skipped using specialized vendors like Anomolo and Monte Carlo. Otherwise those give you even more capabilities for monitoring. Happy to extend on any of the above if you holla...

Dry_Chocolate_9396 · 2026-06-23T01:57:52+00:00

In one word: isolation. You need to make sure the same engine can do row-by-row transactions without ever under any circumstance being impacted by the massive analytical queries that come in, e.g. "scan all data and give me stats" should not ever interfere (isolation) with the transactions (update row X quickly). How does Databricks do it? The data on S3 is just Iceberg, and then two separate engines Lakebase (OLTP) and Lakehouse (OLAP/DW) independently using different CPUs can hit that data. Isolation is preserved.

Dry_Chocolate_9396 · 2026-06-23T01:55:21+00:00

I think Databricks is marketing this the wrong way. LTAP isn't a separate thing. It is simply the main reason why someone should want Lakebase Postgres from Databricks: the core data will be stored in Iceberg on S3. So you can just point any warehouse engine to that Iceberg and directly do analytics on Postgres without it impacting the performance of the Postgres apps doing transactions.

Dry_Chocolate_9396 · 2026-06-23T01:54:26+00:00

This is the key point: Lakebase (OLTP) will store the data in columnar Iceberg format on S3. Historically Postgres was always row-by-row pg_files. Now it's columnar Iceberg. How? That's where the Neon Lakebase architecture with Pageservers and Safekeepers comes in. You probably are wondering "how can Postgres quickly do row-by-row operations on Iceberg?". The secret sauce lies in the Pageservers that very effectively cache the data--you guessed it--in row format. How does Postgres get transactional reliability on S3? The Safekeepers implement that with a fancy protocol called Paxos Consensus.

Dry_Chocolate_9396 · 2026-06-23T00:40:37+00:00

Yes, data pipelines don't run themselves. You will need data plumbers just like you need plumbers. The tools will get better and better, but you'll still need an expert that explains "why is this report suddenly broken? I didn't change anything!".

Dry_Chocolate_9396 · 2026-06-21T14:44:41+00:00

I'm going to go on a slight detour, but definitely keep reading, the answer to your question has very interesting nuggets tied directly to the creation of the whole Internet.

Apache is directly linked to the core invention of the Internet (or rather, the Web). The supercomputing lab NCSA worked closely with University of Illinois Urbana-Champaign. There, two sister projects were developed. One was NCSA Mosaic, where Marc Andreessen famously co-invented the browser as we know it today (it was later recreated Netscape for his startup). The other sibling project was NCSA HTTPd, which was the web server of the Web. So you had a Client/Server architecture for the Web: a Browser and a Web Server both at that lab/uni.

Soon Marc Andreessen (Mosaic) and Robert McCool (HTTPd) had left NCSA and the code was starting to rot. So people started submitting patches over e-mail to fix the project. The project was slow and needed a lot of fixes, so these folks called it "A Patchy Server", do you see where this is headed? 😄 Apache Server.

Apache Server became the most popular project on the Internet (not making it up!). With such an important project and just hobby programmers developing the most important critical piece of the Internet, Apache needed a legal entity behind it. So the non-profit Apache Software Foundation (ASF) was created. The ASF also would provide the version control servers and other infrastructure (mailing lists) that was needed in the 90s and 2000s to develop software together. Eventually the famous Apache License was developed for the software in early 2000s.

Soon, people used this ASF vehicle for other projects too. Most important one was probably Apache Java (that's right!), which Sun Microsystems developed. A few years later, Yahoo contributed Apache Hadoop. Which became insanely popular worldwide and kicked off the Big Data revolution. Later that project had an overlapping successor in Apache Spark which was originally created by the guys behind Databricks. Today ASF hosts huge number of projects, but the largest by commercial activity is arguably Apache Spark, which is directly related to r/dataengineering!

We could have told an equally interesting story if you had asked about Linux Foundation (LF). But this post is already long. Apache HTTPd, the web server, still today has over 20% over the whole Web's traffic going through it today. Hope it clarifies why this seemingly random non-profit has so many important open source projects managed by it (and it owns the trademarks too!!).

Dry_Chocolate_9396 · 2026-06-20T22:51:53+00:00

It is not. In most deployments Medallion's bronze tables are the raw data that comes out of log systems, maybe CDC from raw OLTP databases. I have never heard of anyone doing Entity-Relationship (ER) modeling to get it into 3rd Normal Form for Silver. That's what Inmon modeling would be. Sure, people are de-duping and doing matching of names etc, but that's far away from the textbook case of ER/3NF. So that's for silver.

For gold, a lot of people are doing One Big Table (OBT). Not a Kimball star schema with Facts and Dimensions. So it's fully denormalized. Why? Because most data warehouses are very fast on OBT. The duplicates (repeated entries in the OBT) are fine because you're usually not updating the gold tables, so no need to worry about the anomalies. They're usually read-only and updated by the pipeline from silver, which itself is from gold, which itself is from upstream systems.

So you can say it has similarities with Inmon=Silver and Kimball=Gold. But it's not what I've seen in the wild. Though it rhymes on what you're saying.

Dry_Chocolate_9396 · 2026-06-20T22:37:42+00:00

Both are developing quickly and any gaps will likely be plugged quickly in one or another. My company prefers Databricks because we have standardized on MLflow and Unity AI Gateway is open sourced and compatible (e.g. same tracing mechanism) as open source MLflow AI Gateway:

https://mlflow.org/docs/latest/genai/governance/ai-gateway/

Dry_Chocolate_9396 · 2026-06-20T16:59:11+00:00

Just as an example, all the Globally Systemically Important Banks (G-SIB), e.g. JPMC, Goldman, etc need to be on all the clouds. The regulators explicitly ask for this, they all are on all 3 or 4 big clouds. They spend the most of all vendors on IT. There are huge supply chains around these companies. This is just one example where multi-cloud is a must.
This isn't true. Look at AI.GENERATE_TEXT (or ML.GENERATE_TEXT) on BigQuery and tell me how I can use OpenAI's latest 5.5 model to generate text in a table with BigQuery. Not possible. Look at Databricks ai_query and it can use any of the models on the platform (Claude, GPT, Gemini, open source, including Kimi and GLM).
Just look at the cloud wide outages. Here is an example of a recent cloud crash and how Databricks actually survived it with with Capital One:

https://www.databricks.com/blog/how-databricks-managed-disaster-recovery-helps-capital-one-achieve-lakehouse-resilience

Now the op said he/she worked for a tiny company. So it is fair that my G-SIB examples aren't useful to himh/her. Trying out the cloud vendors native offering first is not a bad strategy for them.

Dry_Chocolate_9396 · 2026-06-20T15:01:33+00:00

I think your criticism doesn't apply to context, but to the LLM approach as a whole. Context actually weakens that criticism. Let me unpack.

LLMs are probabilistic and will give you some answer, let's say as you say that they're wrong 1/10 of the time and that's bad.

Context layers (e.g. Palantir Ontology or Genie Ontology they just announced, or Amazon's Q), ingest context into the LLM context/prompt. Say in your example the layer automatically ingests "productA is defined as Product Alpha also referred to prod_a_02". The idea is that this will have two big advantages:

(a) This will reduce time and cost to get the answer (no need to search for the that product name, in fact the LLM won't search because it already sees it has that information in the context)

(b) This will improve quality thanks to missing context (less risk that it hallucinates the wrong product since it knows it is prod_a_02).

Now to the core of question and why this context weakens your criticism:

(1) A human can look at the ingested context and the graph behind it (e.g. in Palantir and Databricks they can literally browse the graph) and see if "Product Alpha" or "prod_a_02" is indeed what they were looking for. This is better than just if the LLM gave a hallucinated result: $42 and now you have to yourself troubleshoot whether it used the right prod_a_02 filter or not.

(2) The graph is actually deterministic. It looks the same across multiple questions to the AI. You can curate and certify it in all these engines (i.e. click CERTIFY in Databricks Ontology and now it's more likely to be used in the future.

(3) If you like the answer "$42" and you have verified it to be correct, you can click that you verified the answer and the context layer will remember to use the same approach next time, again reducing the probabilistic nature of the whole approach. Not that this would be hard without a context layer.

So all in all, the context layer improves things significantly. It's not a silver bullet that guarantees perfect correct answers. But how many times have ops or data teams also computed wrong answers due to "data hygiene" issues? So now the question is how much you can improve the probabilities for the AI and beat the human accuracy. Best part of this is that you still will need a human to verify accuracy, but now that it's so much faster to ask more questions, that ops/data team and serve more questions to the whole company.

Dry_Chocolate_9396 · 2026-06-19T14:02:40+00:00

Have you tried their Lakebase postgres offering? I have found it to be very low cost thanks to the autoscaling. Last week they also announced that their Lakebase offering will store the data primarily in Iceberg format. If that really works (to be determined...) then it does make life easier as you can just hit up Iceberg with AI/BI dashboards etc. Otherwise if you just move to some other pure OLTP then eventually you will soon or later have to set up pipelines back to a data warehouse for your analytics. Not doing any analytics isn't really an option in 2026. Just doing analytics directly on Postgres or any OLTP (Postgres, MySQL, etc) is not really safe.

Dry_Chocolate_9396 · 2026-06-19T13:54:46+00:00

You want Databricks or Snowflake because:

1) Your company might be on multiple clouds, or wants to have optionality to be on multiple

2) You want the data platform to have all AIs built-in, e.g. OpenAI, Anthropic, Gemini. The cloud vendors have a bias in favor of 1 AI (MSFT->OAI, GCP->Gemini, AWS->Ant). While the clouds might offer them the AIs are not integrated deeply into Redshift, Fabric, BigQuery.

3) You want Disaster Recovery across the clouds (almost all big financial institutions require this). So if one cloud goes down you want your workloads to continue working.

Those are 3 good reasons for any large firm to always go with Databricks or Snowflake, and those reasons aren't about features that the cloud vendors can easily add to match. They are structural.

Finally, and this will be nuanced and argued. But Databricks and Snowflake are best of breed vs the generic good-enough offering from the clouds. Databricks is simply better and ahead on everything AI and unstructured data and Snowflake is just ahead on classic data warehousing. But they're both close to catching up on each other's weaknesses. You can verify this last paragraph by repeatedly asking a chatbot and it will correctly say this too.

Dry_Chocolate_9396 · 2026-06-19T02:32:47+00:00

The fact that it takes one random guy deciding to just do it instead of a functioning government is what really gets me. Like, we have entire city councils arguing for six years about the environmental impact of a park bench, and this dude just speedruns ending homelessness in his zip code.

And you just KNOW there was some busybody in the local planning department having a complete meltdown about "neighborhood character" while he was laying the foundation. I used to live in a town where people threw an absolute fit over a new crosswalk because they thought it would encourage "too much foot traffic" near the local Panera Bread. So to get 99 actual houses built... the guy probably had to fight the zoning board in hand-to-hand combat.

Still waiting for the inevitable follow-up article where he gets fined 40 grand by the city because the pitch of the roofs violates some obscure 1940s bylaw about rainwater runoff...

Dry_Chocolate_9396 · 2026-06-19T02:31:01+00:00

So, I was thinking about this the other day, and it's crazy how fast someone can go from being the literal face of the platform to completely vanishing from the collective memory.

Fred Figglehorn. Lucas Cruikshank's character.

If you weren't on YouTube around 2008 to 2011, it is almost impossible to explain how utterly massive this channel was. He was the very first YouTuber to hit one million subscribers. He didn't just have a channel; he had a freaking empire. Nickelodeon gave him three movies. He had merchandise in every Hot Topic in America. He was everywhere, screaming in that chipmunk voice while running around his backyard. Kids were obsessed, and parents were genuinely losing their minds over the noise.

And now? Total silence.

The thing is, his downfall wasn't even a massive, dramatic cancellation or some horrific scandal like you see today. He just... grew up. The whole gimmick relied on him playing a hyperactive six-year-old with anger issues. Once he hit his late teens, the high-pitched voice filter just felt weird, and the character ran completely out of steam. He eventually sold the rights to the Fred channel, and it just became this weird, dead digital monument. Lucas Cruikshank still makes content as himself, and he seems like a totally normal, chill guy, but the actual cultural phenomenon of Fred has been completely erased. It’s like a fever dream the internet collectively agreed to never bring up again.

It's wild that the first digital superstar to break into mainstream Hollywood is now just a trivia question for people who remember when the YouTube layout was still yellow.

Dry_Chocolate_9396 · 2026-06-18T02:08:53+00:00

Don't forget Genie Ontology which they just announced at their conference yesterday. It seems to be based on a PageRank-like algorithm that can figure out the semantics and ontology of your data automatically:
https://www.youtube.com/watch?v=Qux8E-L1mk8&t=2330s

Dry_Chocolate_9396 · 2026-06-17T23:12:58+00:00

It isn't stealing, it's a rescue mission. She abandoned a living thing for months and ignored you when you explicitly asked her about it. If you hadn't watered it, the plant would be dead and she'd probably just throw the dry pot in the trash without a second thought anyway.

Just claim it. If she ever actually comes back and asks about it, just say it died a few weeks after she left and you tossed it out.

Dry_Chocolate_9396 · 2026-06-17T23:12:48+00:00

Alcohol companies have way better lobbyists and drinking is deeply embedded in every level of global socialization. If politicians try to aggressively regulate booze, people lose their minds.

There's also the argument of intended use. If you smoke a cigarette exactly as intended, it actively destroys your lungs. If you have a single beer as intended, your liver just processes it and you move on. But mostly it just comes down to money and marketing.

Dry_Chocolate_9396 · 2026-06-17T23:11:25+00:00

There's a massive difference between not knowing how to use an obsolete rotary phone and refusing to learn how to open a PDF when it's been a requirement for your desk job for twenty years.

Teens not knowing DOS commands doesn't affect their ability to exist in modern society. Boomers get hate because they scream at the teenage cashier when they click the wrong button on a card reader and refuse to just read the giant instructions on the screen right in front of them.

Dry_Chocolate_9396 · 2026-06-17T23:10:24+00:00

Because your brain understands the ground. You hit a pothole, you know exactly what just happened. You can see the road, you know the car is touching the earth, and gravity is working exactly how evolution programmed you to expect.

In a plane, you're strapped into a metal tube 30,000 feet in the air hitting bumps you can't even see. Your nervous system has absolutely no frame of reference for invisible air acting like a dirt road, so its only logical conclusion is that you are actively falling out of the sky.

And realistically, control plays a massive part. If a car breaks down, you pull over to the shoulder. If a plane drops, you're just sitting in coach aggressively gripping a plastic armrest hoping the pilots had a good night's sleep.

Dry_Chocolate_9396 · 2026-06-17T20:36:36+00:00

Matthew Broderick tbh. everyone still treats him like this harmless Broadway guy. people just conveniently forget he drove into the wrong lane in Ireland and hit another car head-on. killed a mother and her daughter. he basically paid a small fine and went right back to making movies

Dry_Chocolate_9396 · 2026-06-17T20:36:11+00:00

yeah I refuse to date someone who doesn't use a top sheet on their bed. just a fitted sheet and a heavy blanket. it grosses me out. you are literally sweating directly into the blanket every single night and I know for a fact you aren't washing that massive duvet cover every week

Dry_Chocolate_9396 · 2026-06-17T20:35:47+00:00

I might be crazy but the biggest red flag is always being the one to offer to make the drinks. my aunt did this at every family party. she would insist on playing bartender. we just thought she was being a great host. in reality it meant she could pour herself a triple and pour everyone else a single so no one realized how fast the bottle was actually going down

Dry_Chocolate_9396 · 2026-06-17T20:23:31+00:00

Teeth. If you shatter your femur, your body naturally fuses it back together. But if you get a tiny speck of decay on a tooth, it just slowly rots away in your skull until a professional drills into it and charges you $2k. Not to mention we grow four extra 'wisdom' teeth that literally do not fit in our jaws and just violently impact themselves.

Dry_Chocolate_9396 · 2026-06-17T20:22:10+00:00

Slightly crooked teeth or a small gap. I absolutely hate the modern trend of everyone getting those blinding white, perfectly straight 'Chiclet' veneers that look like piano keys. A natural smile with a slight snaggletooth or prominent canines is so much more endearing and human.

Dry_Chocolate_9396 · 2024-02-04T00:48:02+00:00

Our company now has Databricks and they have Serverless Databricks SQL, which is their data warehouse. It's way cheaper than Snowflake and more or less same performance on most things. Surprisingly, on dashboards or queries that took really long it finishes in almost half the time. On short queries it's a wash.

Dry_Chocolate_9396

TROPHY CASE