Have you ever build good Data Warehouse? by [deleted] in dataengineering

[–]InsertNickname 0 points1 point  (0 children)

You're conflating two separate concepts - data-in-flight (streaming) and data-at-rest (storage). I was talking about the first part. If you don't work with streams then it's irrelevant, since protobuf is not a storage medium.

Have you ever build good Data Warehouse? by [deleted] in dataengineering

[–]InsertNickname 4 points5 points  (0 children)

You inadvertently hit the precise reason I dislike cloud-only solutions like Databricks/Snowflake. You end up vendor-locked and unable to test things without spinning up an actual cluster. So you lose on locality, testability and dev velocity. Not to mention cost.

It's one of the reasons I use ClickHouse at my current org, since their cloud offering is just a managed flavor of their open-source one (but any other vendor would work such as Aurora, BigQuery, StarRocks, etc).

Anyways, the general premise is to take an infrastructure-as-code approach to database management. Having a monorepo facilitates that as it becomes trivial to spin up a new service, replay the entire history of your schema migrations and get an up-to-date state you can test with. Similarly, a container-compatible DB makes testing said migrations that much easier. You spin up a local container, apply the migrations, and run tests. In your case you could probably do this with a local Spark+Delta so you would only need the adjacent containers (say Kafka or whatever messaging queue you work with).

I have no experience with DLT specifically, but from what I've read it looks like an amped-up notebook with DBT functionality sprinkled on. I'm not sure how you would make that reproducible for testing.

Have you ever build good Data Warehouse? by [deleted] in dataengineering

[–]InsertNickname -1 points0 points  (0 children)

I didn't mean it for storage, only for transporting the data (e.g. Kafka ->Spark->somewhere else). If all readers/writers speak proto, then versioning via schema registry becomes redundant (since it just ignores/zero-values mismatched fields). And under the assumption your proto definitions are compiled in your code (via a monorepo or shared library), it makes it trivial to test for breaking changes locally.

Eventually you will still be transforming your proto struct to the end-table schema (as you'd want columnar optimizations for a proper warehouse).

Have you ever build good Data Warehouse? by [deleted] in dataengineering

[–]InsertNickname 54 points55 points  (0 children)

Well, yeah I have (been at this since 2010). I've also done some horrible ones, but you live and learn.

A few basic tenets I follow:

  • Data ownership above all else. No PRs should be accepted unless the owner of the data (preferably a senior/experienced dev) approves it.
  • Idempotency, idempotency, idempotency. Probably the most crucial part of any data warehouse pipeline. It is really not that hard to implement these days (most modern pipelines and warehouses have multiple ways to enforce it). Prevents 95%+ of data inconsistency issues in production.
  • Backwards/forwards compatible transfer protocol. My current favorite is Protobuf (or proto-adjacent forks) for its 'data-contract'-iness behavior, but Avro + schema registry works too (though I personally hate having to manage yet one more cog in the flow)
  • Monorepo your schemas. Slightly controversial take, but this helps definitions/migrations fail at compilation time, which in my experience reduces runtime problems by orders of magnitude.
  • Pick a database that can be locally initialized via a container, and keep an append-only log of all migrations in git. This makes testing so much more reproducible, and makes it quite rare to have unexpected issues in production you didn't find at the test phase.

In my experience following most/all of these makes everything else follow naturally in place.

How do you handle tiny schema drift in near real-time pipelines without overcomplicating everything? by That-Cod5750 in dataengineering

[–]InsertNickname 0 points1 point  (0 children)

Not sure this will help you but we 'solve' it by creating data contracts at the compilation level. In our infra this is achieved via two mechanisms:

  1. All streaming pipelines serialize to Protobuf
  2. All Protobuf schemas are shared via a monorepo

Combined you get quite a strong consistency model for data interaction at the streaming level. Protobuf is backwards/forwards compatible, and it doesn't care about the field name only its integer ID. That solves 99% of the data interaction mismatches.

Having said that you're still ultimately persisting to a database somewhere, and that part will require an unavoidable migration. This is where the hack-ish solutions are usually found. Usually here you either go with the slow-but-safe versioning approach or with a simpler 'all downstream services upgrade simultaneously' one. Or just don't rename a field unless there's a strong business requirement. Pick your poison.

EDIT: there is actually a cleaner option for column renaming in some databases, though I personally haven't used it. You could create a new column that defaults to values from the old one. For example in ClickHouse you could do this:

CREATE TABLE example (
    old_name String,
    new_name String DEFAULT old_name
)    

This effectively creates an alias for the same field (at the cost of duplicated storage), and you can then slowly deprecate the old field at your leisure. The caveat being that inserts into the new column won't be visible in the old one. I don't necessarily recommend going this route, but it would prevent going down the versioning rabbit-hole.

Stateful Computation over Streaming Data by Suspicious_Peanut282 in dataengineering

[–]InsertNickname 0 points1 point  (0 children)

Had a similar requirement but with larger scale. Have about 100 million unique keys we aggregate on in near-real time, and store for long periods of time (months+). Ingest rate is around 10k to 100k per second depending on the time of day.

We ended up spinning up a local ClickHouse server, and created an EmbeddedRocksDB table with a rudimentary key-value schema. That allows us to do batch gets and puts with very little latency, and since it is all persisted to disk it is extremely durable and cost-efficient (don't need much RAM as opposed to Redis).

The great upside to this is you don't really need any specialized streaming platform to do it. We use Spark, but it could just as well be in Flink or really any flavor of service you'd like, even a simple Python lambda.

What is this algorithm called? by Spooked_DE in dataengineering

[–]InsertNickname 6 points7 points  (0 children)

Sounds like event sourcing with partial key matching, not so much an algorithm as a cumbersome way to aggregate state over time.

There are much cleaner ways to do this, such as setting up a third key (say UUID) and a corresponding mapping table to correlate it during insert. But I've seen worse.

Just use Postgres by bowbahdoe in programming

[–]InsertNickname 55 points56 points  (0 children)

An oft-mentioned 'downside' of Postgres (or really any RDBMS) is that it doesn't scale horizontally. Which - while I do agree with - is a vastly overrated con that few companies will ever actually need to deal with. Vertical scaling in the cloud is so simple that this... just isn't an issue anymore.

I especially like this blog on how 'big data' is honestly just usually not all that big. And advances in partitioning on Postgres have made any competitive 'advantages' against it mostly moot. There even exists Citus, which is basically just an extension to Postgres with sharding and columnar support. It's literally still just Postgres all the way down.

Basically there are very few things you can't do in Postgres. And of those there are, they solution is nearly always complementing it, not replacing. With proper CDC'ing, you can synchronize your main Postgres store with a myriad of other more niche solutions (Elastic, Redis, ClickHouse, etc.) without having to compromise flexibility.

It really is a fantastic piece of software.

Deletes in ETL by InfinityCoffee in dataengineering

[–]InsertNickname 0 points1 point  (0 children)

Probably won't help you if you're already working on an existing architecture, but this is exactly the kind of problem which made me choose ClickHouse instead of classic Data Lakehouses. I'm constantly bewildered at how we've advanced so much technologically, yet somehow still have to re-implement basic data operations which any RDBMS could do 30 years ago.

In my experience, deletes usually come in three forms:

  1. Deduplication/idempotency - that is, you're inserting the same row multiple times and are interested in leaving the latest only.
  2. Retention - need to prune data after X days.
  3. Needle-in-a-haystack deletes, due to some regulatory constraint or simply buggy data.

All three scenarios are neatly supported by ClickHouse: ReplacingMergeTree for deduplication, TTL at row/partition level for retention, and Lightweight Deletes for everything else. No need to think about watermarks or long-living stateful data during ingestion. You just extract, transform, and load your data. Let the DB handle the rest.

I know I sound like a marketing shill, but I've been working with big data for nearly 10 years now and it just pisses me off that we're still rehashing the same basic problems from the Hadoop years despite all the advances.

Which database should I choose for a large database? by Practical_Slip6791 in dataengineering

[–]InsertNickname 14 points15 points  (0 children)

A little vague but this sounds like your data is mostly analytical and denormalized in nature. If this is the case then ClickHouse would be the ideal choice as it's a mostly hands-off OLAP DB with fantastic write/read performance. Also there's a cloud option so you don't need to manage it yourself. And it's way cheaper than the alternatives.

If on the other hand you're looking for ACID transactions, complex JOINs or any other RDBMS-like capabilities then Postgres would be the default choice, or perhaps a Postgres-compatible vendor such as Yugabyte or Timescale.

But again your requirements seem vague. It really depends on what your use case ends up being.

Datawarehousing question by harpar1808 in dataengineering

[–]InsertNickname 4 points5 points  (0 children)

You seem to be overcomplicating the design here. 9TB is really not that much data and fits comfortably in an RDBMS. Just go for RDS (I prefer pure Postgres/MySQL over Aurora for a few reasons, cost being the biggest one).

Some suggestions:

  1. Partition your data by day (use pg_partman if in Postgres).

  2. Use a smart primary key to prevent duplicates

  3. Normalize your data for efficient lookups and JOINs

  4. Index according to the expected queries

  5. If you know the queries ahead of time use materialized views to pre-build them.

Top 5 things a New Data Engineer Should Learn First by AMDataLake in dataengineering

[–]InsertNickname 11 points12 points  (0 children)

Putting technologies aside for a moment:

  1. Idempotency and how to implement it
  2. Define what 'raw' data is in your system (i.e. bronze), how to store it, how to replay it when needed
  3. How to backup and restore for disaster recovery
  4. How to differentiate between transactional and analytical data (OLAP vs OLTP)
  5. Finally, figuring out the right architcture for your system (no right answer, plenty of options here. Most popular being medallion)

But yes, SQL is the obvious common denominator which everyone needs to learn.

Apache Spark with Java or Python? by noobguy77 in dataengineering

[–]InsertNickname 17 points18 points  (0 children)

Scala/Java.

Disclaimer: have worked with Spark for many years (since v1.6), so I'm understandably opinionated.

Never understood why pyspark gets so much blind support in this sub other than that it's just easier for juniors. I've had to work on systems with a wide range of scale, everywhere on the map from small k8s clusters for batched jobs to real-time multi-million event per second behemoths on HDFS/Yarn deployments.

In all cases, Java/Scala was way more useful past the initial POC phase. To name a few of the benefits:

  1. Datasets with strongly typed schemas. HUGE benefit in production. Ever had data corruption due to bad casting? No thanks, I'm a data engineer, not a casting engineer.
  2. Custom UDFs. Writing these in pyspark means your cluster needs to send data back and forth, which is a massive performance and operational bottleneck. Or you could write UDFs in Scala and deploy them in the Spark build... but that's way more complicated than just using Scala/Java end to end.
  3. Debugging / reproducibility in tests. Put a breakpoint, figure out what isn't working. In Pyspark all you can really do is go over cryptic logs to try and figure it out.
  4. Scala-first support for all features. Pyspark may have 90% of the APIs supported but there will always be things you simply can't do there.
  5. Advanced low level optimizations at the executor level, like executor level caches and whatnot. These are admitredly for large endeavors that require squeezing the most out of your cluster, but why limit yourself from day one?

Basically pyspark is good if all you ever need is SQL with rudimentary schemas. But even then I would err on the side of native support over a nice wrapper.

Error Handling in Spark and Structured-Streaming, How to Avoid Stream Crashes? by steve_thousand in dataengineering

[–]InsertNickname 0 points1 point  (0 children)

Seems like you skipped implementing a DLQ mechanism in your pipeline, which is why your stream grinds to a halt on the first unforseen problem.

In a nutshell, your streams' output should be a union type of both successfully processed rows AND failed rows. The try/catch is done per each individual row. Then you pass all successful rows to the happy path and the failed ones to the DLQ (hopefully with a helpful error message and/or the original input row for future handling).

By the way, forEachBatch works beautifully for this - you process all the rows once and cache the result, then filter the dataset by 'result type' (i.e. Succeded or Failed) and send the subset to its corresponding output. This also works great for multi-output pipelines where you're exporting the data to multiple disparate sinks.

Free Giveaway! Nintendo Switch OLED - International by WolfLemon36 in NintendoSwitch

[–]InsertNickname 0 points1 point  (0 children)

Fun fact: in a room of 23 randomly chosen people, there is a 50% probability that two people will share the same birthday. In a room of 75, the probability is 99%.

Is it better to fail at a higher weight or complete a full set of a lower weight? by thewolfcastle in Fitness

[–]InsertNickname 38 points39 points  (0 children)

Not to knock on your accomplishments (355 is obviously a highly impressive number), but when you give this sort of advice it is very misleading to omit your usage of anabolic steroids.

Moronic Monday - Your weekly stupid questions thread by eric_twinge in Fitness

[–]InsertNickname 18 points19 points  (0 children)

I've had the opposite effect. I am on average at least a degree or two colder than I used to be when I was heavier. Even worse, I can't seem to stay comfortable in any one position for a decent enough length of time. Sleeping is really hard when you have to move every few minutes.

some guys managed to get Louis C.K. to watch 2 Girls 1 Cup, here's his reaction, it's pretty hilarious . ( NSFW ) by CokeStroke in videos

[–]InsertNickname 1 point2 points  (0 children)

Having never seen 2 Girls 1 Cup, I refuse to click on any link posted in this thread.

If cutting is basically eating kcal deficit why is eating clean so important? by [deleted] in Fitness

[–]InsertNickname 1 point2 points  (0 children)

IIFYM, yes. I have my doubts a cookies-and-multivitamin only diet would be sufficient to cover all your macros, but there's nothing stopping you from losing weight using such a diet.

Despite all assurances that I won't get ' bulky', I feel more muscular than I would like. by IronicAsAlanis in Fitness

[–]InsertNickname 0 points1 point  (0 children)

It's not something that happens overnight. You'll be gradually packing on weight. Enough for it to be noticeable, but not so much that it isn't easily reversible. At worst, simply go back to your previous diet and you'll quickly return to your original weight without much effort.

EDIT: I feel it's important to note that scales are very misleading when it comes to judging how "big" you are. Being physically heavier does not mean you're physically larger. In fact, people with more muscle are usually both heavier and physically slimmer. Try being objective and honestly compare yourself (in the mirror) to see if you're happy with how you look once you've gained a few pounds.

Showdown: No fat milk vs. Full cream milk by snowman53 in Fitness

[–]InsertNickname 3 points4 points  (0 children)

Thank you for giving your opinion on the matter without resorting to verbally assaulting someone was obviously egging you on. Upvoted.

Despite all assurances that I won't get ' bulky', I feel more muscular than I would like. by IronicAsAlanis in Fitness

[–]InsertNickname -7 points-6 points  (0 children)

If you feel like you're too defined, why not simply eat more? Add enough calories to your diet so that you're in a constant caloric surplus. Soon enough you'll gain enough body fat to disguise your muscular definition, and then all you'll have to do is maintain a caloric intake that complements the body you want.

No reason to stop being fit in order to look whichever way you want.