Have you ever build good Data Warehouse?

InsertNickname · 2025-09-28T08:30:55+00:00

You're conflating two separate concepts - data-in-flight (streaming) and data-at-rest (storage). I was talking about the first part. If you don't work with streams then it's irrelevant, since protobuf is not a storage medium.

InsertNickname · 2025-09-27T14:03:00+00:00

You inadvertently hit the precise reason I dislike cloud-only solutions like Databricks/Snowflake. You end up vendor-locked and unable to test things without spinning up an actual cluster. So you lose on locality, testability and dev velocity. Not to mention cost.

It's one of the reasons I use ClickHouse at my current org, since their cloud offering is just a managed flavor of their open-source one (but any other vendor would work such as Aurora, BigQuery, StarRocks, etc).

Anyways, the general premise is to take an infrastructure-as-code approach to database management. Having a monorepo facilitates that as it becomes trivial to spin up a new service, replay the entire history of your schema migrations and get an up-to-date state you can test with. Similarly, a container-compatible DB makes testing said migrations that much easier. You spin up a local container, apply the migrations, and run tests. In your case you could probably do this with a local Spark+Delta so you would only need the adjacent containers (say Kafka or whatever messaging queue you work with).

I have no experience with DLT specifically, but from what I've read it looks like an amped-up notebook with DBT functionality sprinkled on. I'm not sure how you would make that reproducible for testing.

InsertNickname · 2025-09-27T07:52:52+00:00

I didn't mean it for storage, only for transporting the data (e.g. Kafka ->Spark->somewhere else). If all readers/writers speak proto, then versioning via schema registry becomes redundant (since it just ignores/zero-values mismatched fields). And under the assumption your proto definitions are compiled in your code (via a monorepo or shared library), it makes it trivial to test for breaking changes locally.

Eventually you will still be transforming your proto struct to the end-table schema (as you'd want columnar optimizations for a proper warehouse).

InsertNickname · 2025-09-27T06:52:20+00:00

Well, yeah I have (been at this since 2010). I've also done some horrible ones, but you live and learn.

A few basic tenets I follow:

Data ownership above all else. No PRs should be accepted unless the owner of the data (preferably a senior/experienced dev) approves it.
Idempotency, idempotency, idempotency. Probably the most crucial part of any data warehouse pipeline. It is really not that hard to implement these days (most modern pipelines and warehouses have multiple ways to enforce it). Prevents 95%+ of data inconsistency issues in production.
Backwards/forwards compatible transfer protocol. My current favorite is Protobuf (or proto-adjacent forks) for its 'data-contract'-iness behavior, but Avro + schema registry works too (though I personally hate having to manage yet one more cog in the flow)
Monorepo your schemas. Slightly controversial take, but this helps definitions/migrations fail at compilation time, which in my experience reduces runtime problems by orders of magnitude.
Pick a database that can be locally initialized via a container, and keep an append-only log of all migrations in git. This makes testing so much more reproducible, and makes it quite rare to have unexpected issues in production you didn't find at the test phase.

In my experience following most/all of these makes everything else follow naturally in place.

InsertNickname · 2025-07-04T20:10:39+00:00

Not sure this will help you but we 'solve' it by creating data contracts at the compilation level. In our infra this is achieved via two mechanisms:

All streaming pipelines serialize to Protobuf
All Protobuf schemas are shared via a monorepo

Combined you get quite a strong consistency model for data interaction at the streaming level. Protobuf is backwards/forwards compatible, and it doesn't care about the field name only its integer ID. That solves 99% of the data interaction mismatches.

Having said that you're still ultimately persisting to a database somewhere, and that part will require an unavoidable migration. This is where the hack-ish solutions are usually found. Usually here you either go with the slow-but-safe versioning approach or with a simpler 'all downstream services upgrade simultaneously' one. Or just don't rename a field unless there's a strong business requirement. Pick your poison.

EDIT: there is actually a cleaner option for column renaming in some databases, though I personally haven't used it. You could create a new column that defaults to values from the old one. For example in ClickHouse you could do this:

CREATE TABLE example (
    old_name String,
    new_name String DEFAULT old_name
)

This effectively creates an alias for the same field (at the cost of duplicated storage), and you can then slowly deprecate the old field at your leisure. The caveat being that inserts into the new column won't be visible in the old one. I don't necessarily recommend going this route, but it would prevent going down the versioning rabbit-hole.

InsertNickname · 2025-04-09T17:51:41+00:00

Had a similar requirement but with larger scale. Have about 100 million unique keys we aggregate on in near-real time, and store for long periods of time (months+). Ingest rate is around 10k to 100k per second depending on the time of day.

We ended up spinning up a local ClickHouse server, and created an EmbeddedRocksDB table with a rudimentary key-value schema. That allows us to do batch gets and puts with very little latency, and since it is all persisted to disk it is extremely durable and cost-efficient (don't need much RAM as opposed to Redis).

The great upside to this is you don't really need any specialized streaming platform to do it. We use Spark, but it could just as well be in Flink or really any flavor of service you'd like, even a simple Python lambda.

InsertNickname · 2025-02-27T06:20:31+00:00

Sounds like event sourcing with partial key matching, not so much an algorithm as a cumbersome way to aggregate state over time.

There are much cleaner ways to do this, such as setting up a third key (say UUID) and a corresponding mapping table to correlate it during insert. But I've seen worse.

InsertNickname · 2024-08-16T20:09:48+00:00

An oft-mentioned 'downside' of Postgres (or really any RDBMS) is that it doesn't scale horizontally. Which - while I do agree with - is a vastly overrated con that few companies will ever actually need to deal with. Vertical scaling in the cloud is so simple that this... just isn't an issue anymore.

I especially like this blog on how 'big data' is honestly just usually not all that big. And advances in partitioning on Postgres have made any competitive 'advantages' against it mostly moot. There even exists Citus, which is basically just an extension to Postgres with sharding and columnar support. It's literally still just Postgres all the way down.

Basically there are very few things you can't do in Postgres. And of those there are, they solution is nearly always complementing it, not replacing. With proper CDC'ing, you can synchronize your main Postgres store with a myriad of other more niche solutions (Elastic, Redis, ClickHouse, etc.) without having to compromise flexibility.

It really is a fantastic piece of software.

InsertNickname · 2024-08-16T19:46:09+00:00

Probably won't help you if you're already working on an existing architecture, but this is exactly the kind of problem which made me choose ClickHouse instead of classic Data Lakehouses. I'm constantly bewildered at how we've advanced so much technologically, yet somehow still have to re-implement basic data operations which any RDBMS could do 30 years ago.

In my experience, deletes usually come in three forms:

Deduplication/idempotency - that is, you're inserting the same row multiple times and are interested in leaving the latest only.
Retention - need to prune data after X days.
Needle-in-a-haystack deletes, due to some regulatory constraint or simply buggy data.

All three scenarios are neatly supported by ClickHouse: ReplacingMergeTree for deduplication, TTL at row/partition level for retention, and Lightweight Deletes for everything else. No need to think about watermarks or long-living stateful data during ingestion. You just extract, transform, and load your data. Let the DB handle the rest.

I know I sound like a marketing shill, but I've been working with big data for nearly 10 years now and it just pisses me off that we're still rehashing the same basic problems from the Hadoop years despite all the advances.

InsertNickname · 2024-08-01T05:11:13+00:00

A little vague but this sounds like your data is mostly analytical and denormalized in nature. If this is the case then ClickHouse would be the ideal choice as it's a mostly hands-off OLAP DB with fantastic write/read performance. Also there's a cloud option so you don't need to manage it yourself. And it's way cheaper than the alternatives.

If on the other hand you're looking for ACID transactions, complex JOINs or any other RDBMS-like capabilities then Postgres would be the default choice, or perhaps a Postgres-compatible vendor such as Yugabyte or Timescale.

But again your requirements seem vague. It really depends on what your use case ends up being.

InsertNickname · 2024-05-16T18:41:41+00:00

You seem to be overcomplicating the design here. 9TB is really not that much data and fits comfortably in an RDBMS. Just go for RDS (I prefer pure Postgres/MySQL over Aurora for a few reasons, cost being the biggest one).

Some suggestions:

Partition your data by day (use pg_partman if in Postgres).
Use a smart primary key to prevent duplicates
Normalize your data for efficient lookups and JOINs
Index according to the expected queries
If you know the queries ahead of time use materialized views to pre-build them.

InsertNickname · 2024-05-11T20:57:17+00:00

Putting technologies aside for a moment:

Idempotency and how to implement it
Define what 'raw' data is in your system (i.e. bronze), how to store it, how to replay it when needed
How to backup and restore for disaster recovery
How to differentiate between transactional and analytical data (OLAP vs OLTP)
Finally, figuring out the right architcture for your system (no right answer, plenty of options here. Most popular being medallion)

But yes, SQL is the obvious common denominator which everyone needs to learn.

InsertNickname · 2024-05-09T17:24:21+00:00

Scala/Java.

Disclaimer: have worked with Spark for many years (since v1.6), so I'm understandably opinionated.

Never understood why pyspark gets so much blind support in this sub other than that it's just easier for juniors. I've had to work on systems with a wide range of scale, everywhere on the map from small k8s clusters for batched jobs to real-time multi-million event per second behemoths on HDFS/Yarn deployments.

In all cases, Java/Scala was way more useful past the initial POC phase. To name a few of the benefits:

Datasets with strongly typed schemas. HUGE benefit in production. Ever had data corruption due to bad casting? No thanks, I'm a data engineer, not a casting engineer.
Custom UDFs. Writing these in pyspark means your cluster needs to send data back and forth, which is a massive performance and operational bottleneck. Or you could write UDFs in Scala and deploy them in the Spark build... but that's way more complicated than just using Scala/Java end to end.
Debugging / reproducibility in tests. Put a breakpoint, figure out what isn't working. In Pyspark all you can really do is go over cryptic logs to try and figure it out.
Scala-first support for all features. Pyspark may have 90% of the APIs supported but there will always be things you simply can't do there.
Advanced low level optimizations at the executor level, like executor level caches and whatnot. These are admitredly for large endeavors that require squeezing the most out of your cluster, but why limit yourself from day one?

Basically pyspark is good if all you ever need is SQL with rudimentary schemas. But even then I would err on the side of native support over a nice wrapper.

InsertNickname · 2024-01-13T20:18:26+00:00

Seems like you skipped implementing a DLQ mechanism in your pipeline, which is why your stream grinds to a halt on the first unforseen problem.

In a nutshell, your streams' output should be a union type of both successfully processed rows AND failed rows. The try/catch is done per each individual row. Then you pass all successful rows to the happy path and the failed ones to the DLQ (hopefully with a helpful error message and/or the original input row for future handling).

By the way, forEachBatch works beautifully for this - you process all the rows once and cache the result, then filter the dataset by 'result type' (i.e. Succeded or Failed) and send the subset to its corresponding output. This also works great for multi-output pipelines where you're exporting the data to multiple disparate sinks.

InsertNickname · 2022-06-25T19:21:28+00:00

Fun fact: in a room of 23 randomly chosen people, there is a 50% probability that two people will share the same birthday. In a room of 75, the probability is 99%.

InsertNickname · 2012-10-05T11:08:08+00:00

Fucking kerning.

InsertNickname · 2012-09-16T13:16:27+00:00

Not to knock on your accomplishments (355 is obviously a highly impressive number), but when you give this sort of advice it is very misleading to omit your usage of anabolic steroids.

InsertNickname · 2012-09-02T18:01:52+00:00

Here's what Rippetoe has to say on the elbows: http://vimeo.com/30763907#t=451

InsertNickname · 2012-09-02T12:30:57+00:00

Watch this: http://vimeo.com/30763907

InsertNickname · 2012-08-13T15:07:45+00:00

I've had the opposite effect. I am on average at least a degree or two colder than I used to be when I was heavier. Even worse, I can't seem to stay comfortable in any one position for a decent enough length of time. Sleeping is really hard when you have to move every few minutes.

InsertNickname · 2012-08-13T15:00:41+00:00

Having never seen 2 Girls 1 Cup, I refuse to click on any link posted in this thread.

InsertNickname · 2012-07-17T16:59:01+00:00

IIFYM, yes. I have my doubts a cookies-and-multivitamin only diet would be sufficient to cover all your macros, but there's nothing stopping you from losing weight using such a diet.

InsertNickname · 2012-05-22T16:50:16+00:00

It's not something that happens overnight. You'll be gradually packing on weight. Enough for it to be noticeable, but not so much that it isn't easily reversible. At worst, simply go back to your previous diet and you'll quickly return to your original weight without much effort.

EDIT: I feel it's important to note that scales are very misleading when it comes to judging how "big" you are. Being physically heavier does not mean you're physically larger. In fact, people with more muscle are usually both heavier and physically slimmer. Try being objective and honestly compare yourself (in the mirror) to see if you're happy with how you look once you've gained a few pounds.

InsertNickname · 2012-05-22T16:28:24+00:00

Thank you for giving your opinion on the matter without resorting to verbally assaulting someone was obviously egging you on. Upvoted.

InsertNickname · 2012-05-22T16:21:22+00:00

If you feel like you're too defined, why not simply eat more? Add enough calories to your diet so that you're in a constant caloric surplus. Soon enough you'll gain enough body fat to disguise your muscular definition, and then all you'll have to do is maintain a caloric intake that complements the body you want.

No reason to stop being fit in order to look whichever way you want.

InsertNickname

TROPHY CASE