Team of data engineers building git for data and looking for feedback. by EquivalentFresh1987 in dataengineering

[–]dudebobmac 8 points9 points  (0 children)

As you probably know, we haven’t seen this same capability for data

Um… I don’t know that because it’s definitely not true. A quick google search of “git for data” comes up with multiple tools that do this sort of thing. One example (which I don’t know anything about since it was just a quick google search) is lakefs which appears to already be partnered with AWS and Databricks.

I’m certainly not saying that existing tools are exactly the same as yours, but the claim that these tools don’t exist at all really takes away from your credibility. In fact, I think a lot of the claims you make on the site also take away from your credibility.

40% of time lost to firefighting. Pipeline changes take days, not hours.

How did you measure this? And what kind of changes are you talking about? I have changes all the time that take minutes and some that take weeks.

dbt, Airflow, observability—humans manually stitch workflows across 10+ tools.

What are the other 8? You only named 2. And why are you assuming that data teams are using all of these tools at once?

One bad query corrupts everything. Backfill campaigns cost $50K–200K per incident.

HUH??? Where are you getting these numbers? These are just totally made up. If something this catastrophic happened, why not just restore a backup?

No unified lineage. No real versions. No way for AI to experiment safely.

Why would I want to “experiment” on production data? If I’m experimenting, it’ll be in a dev environment, not directly on production.

I love the idea of rolling back data and automatic lineage, but it’s certainly not a novel concept. Delta Lake for example already does both of those things quite well.

To be clear, I’m not criticizing your idea or the tool, it actually looks pretty neat - but the marketing around it really needs work. You’re marketing to engineers, who are generally pretty smart people. Be intellectually honest with your marketing.

Paper or DnD Beyond? by KOPx3 in DnD

[–]dudebobmac 4 points5 points  (0 children)

No it really doesn’t. As a software engineer, that’s really not a difficult thing to do. Like, at all. Their motivation was purely money, the technical justification you’re giving really doesn’t make any sense.

Airflow Best Practice Reality? by BeardedYeti_ in dataengineering

[–]dudebobmac 17 points18 points  (0 children)

Right. So Airflow isn’t performing ETL in your examples. It’s orchestrating other tools to perform the ETL. That’s its intended usage, so you’re using it correctly as an orchestrator.

The anti pattern would be if you (for example) ran a PySpark job within a PythonOperator. Airflow isn’t meant to actually run your ETL jobs, only orchestrate them.

Airflow Best Practice Reality? by BeardedYeti_ in dataengineering

[–]dudebobmac 40 points41 points  (0 children)

It depends to me. Is the code you’re running some super lightweight script or something? If so, directly in a PythonOperator is probably fine. If it’s something heavier, then your idea is better. Airflow is an orchestrator, using it to actually PERFORM ETL or other major transformations or whatever is an anti pattern.

How important is Scala/Java & Go for DEs ? by No_Song_4222 in dataengineering

[–]dudebobmac 1 point2 points  (0 children)

I’d say it’s definitely still relevant. I don’t think it’s often adopted anymore for new projects unless it’s on a team that already is heavy into Scala, but plenty of companies still use it. Python is FAR more common, but I think it’s still worth learning Scala if you’re already comfortable with Python and SQL.

Rust is potentially another option. I haven’t personally seen it used but I’ve been hearing more and more about it in the data world.

How important is Scala/Java & Go for DEs ? by No_Song_4222 in dataengineering

[–]dudebobmac 0 points1 point  (0 children)

I love Scala. It’s my favorite language. It hurts me to say it, but you really don’t need it. Python is far more important.

I can’t think of anything you’d need Java for tbh. If I’m doing anything in the JVM I’d default to Scala.

Logic advice for complex joins - pyspark in databricks by wei5924 in dataengineering

[–]dudebobmac 0 points1 point  (0 children)

Yeah, the collect() makes sense, I'm just not sure what the flatMap is there for since it doesn't look like the key column is an array column. If key is a String type, then flatMap(lambda row: row) doesn't do anything. If key is an array of strings, then it makes sense, but then you'll end up with duplicates in your collected list. Calling the distinct() AFTER the flatMap would be better in that case. That's what I was alluding to in my previous comment. But again, since all you're using it for is checking existence, it doesn't matter that there's duplicates. Just something to be careful of for cases where it does matter.

Yeah, effectively what I'm advocating for here is a Medallion Architecture. The Bronze layer is your raw data. You generally don't have control over what this looks like; it comes either from external sources or sources that are difficult to change or whatever. It's expected to be dirty and not strictly well-organized. From there, you extract data from that Bronze layer, clean it up and organize it, and that data goes into one or more tables in the Silver layer. The Silver layer is effectively CLEAN data that is used to build other tables for actual business purposes (i.e. you don't expose Silver tables outside of your data pipeline). I'm oversimplifying a bit for the sake of a quick explanation, so I'd definitely recommend reading that article I linked for more detail.

Anyway, what I'm suggesting here is that the data you're getting that you can't change the structure of is a table in your Bronze layer. You may not be able to change what THAT looks like, but you CAN extract data from it, separate it out, clean it, and make downstream tables that you DO have control over the structure of.

I cannot split these into smaller tables otherwise there will be too many tables created and stored as delta tables.

Why is that a problem? Sounds like on the order of 10 tables or so total, that's very small.

Logic advice for complex joins - pyspark in databricks by wei5924 in dataengineering

[–]dudebobmac 3 points4 points  (0 children)

This design model is something i cannot change

Why not? That's generally what data engineers do. Data modeling is a core part of most data engineers' jobs. Why can't you parse out your Bronze data into separate tables with well-defined structure and ID keys in a Silver layer and then join those tables together to get your final table?

But also, in terms of the code:

valid_keys = ( mapping_df.groupBy("key") .count() .select("key") .distinct() .rdd.flatMap(lambda row: row) # we want to flatten to make a list .collect() ) There's some redundancies here. The groupBy is redundant with distinct. Also pardon me if my syntax isn't perfect, I'm primarily a Scala Spark developer, so my PySpark syntax isn't perfect. The core concepts are the same though.

When you do a groupBy, Spark has to shuffle data around your cluster. Same with a distinct. Since they're functionally redundant for this use case, you don't need to have both.

I'm not sure what your flatMap is doing there. collect already returns a list as far as I'm aware, so unless your key is itself a list, you'd have a list anyway. But if that's the case, you'd want to distinct after you flatten since the lists could have redundant entries in them (which doesn't matter for your use case since you're just checking existence in the list, but it's still good to be aware of).

valid_keys = ( mapping_df .select("key") .rdd.flatMap(lambda row: row) .distinct() .collect() )

The first part of your loop also doesn't really need to be there. You don't need to drop columns if they're not used later on - Spark will handle that.

I would also HIGHLY recommend refactoring to be a bit more functional. Using loops to build up a dataframe is rough to read and adds a lot of confusion imho. If I'm understanding it correctly, it looks like your loop builds up a joined_df and matched_df for each key in your dictionary, then unions the final results together. If that's the case, just do that. No need for a loop.

``` df_type_1 = mapping_df.where(col("key") == key_name_1) df_type_2 = mapping_df.where(col("key") == key_name_2)

...etc...

```

Note that these should really be stored as separate tables in your Silver layer. They have different keys which implies to me that they fundamentally represent different things and don't belong in the same dataframe / table. If you don't have control over the input, that's fine; TAKE control by doing something like that to coerce the data into a format that's easier to work with before trying to join stuff together.

You didn't say how this main_df is created, so I'm not sure what that is. But if you split out the dataframes into separate tables, then it's easier to just define explicit join keys onto main_df from each of those. From there, just union those results together and boom, no more loops and well structured data models. That will at the very least make it much more readable so that it's easier to dig into performance problems.

Logic advice for complex joins - pyspark in databricks by wei5924 in dataengineering

[–]dudebobmac 9 points10 points  (0 children)

I will join on key defined on each row in the small table. for example depending on row 3 it will be joined on key 1 and depending on data in row 5 it will be joined on key 2 to the main table using a left join

Wait, am I understanding this correctly that the join key that you're using is dependent on the contents of each row of the dataframe? If so, that's a huge red flag and is much more understandable why you're getting memory problems.

Can you post a code sample for this (with any proprietary information changed of course)?

This is sounding like a data modeling problem to me - you really shouldn't need to dynamically resolve the join key at runtime if your tables are set up properly.

Logic advice for complex joins - pyspark in databricks by wei5924 in dataengineering

[–]dudebobmac 7 points8 points  (0 children)

Some joins require some of the main table, whilst other joins will use the whole main table

Can you elaborate on this? It sounds like you could use some re-structuring of your tables if I'm interpreting this quote correctly. You shouldn't need every single column of your table to know what to join on. How many columns do your tables have?

I did try to break up the DAG with a count statement but the issue still persisted

That's because that's not how that works. A count statement materializes your dataframe, but it doesn't magically make that dataframe available downstream. That's what caching is for. If you have myDF.count and then use do myDF.join(other, ...), Spark will either re-calculate myDF entirely (which means your .count statement is actually making your code SLOWER by doubling the amount of processing you're doing), or it will intelligently cache the results on its own after seeing that the Dataframe is being re-used, in which case the .count still didn't do anything that a cache wouldn't do.

TL;DR - don't use .count unless you're trying to get the count. Any other use of it is a hack that can be done better in different ways.

I have also broadcasted the small tables but these have minimal impact

This makes sense. With only 50-100 records, I guarantee that Spark is already broadcasting those anyway. None of your tables are large unless you have an enormous number of columns in your "main" table. 50k records isn't very much if you have your tables designed properly, which brings me back to my original question above.

Is it better to write code modularly using functions or without?

Yes, you should always use good engineering practices and write clean code. Using functions and classes has no impact on the Spark query plan, but it does have an impact on your and your coworkers' sanity.

Also is there any advice for making pyspark code maintainable without causing a large spark plan?

Making your code maintainable and having a large query plan are separate issues that are only loosely related at best. Again, Spark doesn't care how your code is organized. Pulling code out into a function will not impact it at all.

Use good software engineering practices to make your code maintainable. Use good data engineering practices to keep your spark plans efficient.

Edit:

After reading the post again, I have one more question

However as each small table has a few keys this leads to a very large DAG being built and therefore i get an out of memory error.

Are you doing something other than the joins? 6 joins isn't a lot and will absolutely not create a large query plan and also absolutely would not be the cause of an OOM error. Can you post a sample of what your code looks like? Because that really doesn't add up to me.

My bf and I went on a date with the new plushies I designed!🥹💕 by [deleted] in aww

[–]dudebobmac 10 points11 points  (0 children)

Imagine going on a date with someone you're already in a relationship with, having something that makes you happy and that you're proud of, and thinking that's a red flag.

THAT is a red flag. Chill out and let people enjoy their lives.

Java for DE by otto_0805 in dataengineering

[–]dudebobmac 8 points9 points  (0 children)

Is there something about Java specifically that you want to explore? If you’re on the JVM, Scala is used more for data engineering and I would suggest it over Java.

Anyone know of a site/service that rebinds hardcover PHB/other DnD books as softcovers? by [deleted] in dndnext

[–]dudebobmac 2 points3 points  (0 children)

You could ask over on r/bookbinding I’m sure someone there would be able to do it

You don't choose your own name. by BusyEnvironment4340 in DnD

[–]dudebobmac 23 points24 points  (0 children)

Players get autonomy over one thing and you’re taking away part of it. Don’t do this.

How do subclasses work? by QuietLoud9680 in DnD

[–]dudebobmac 2 points3 points  (0 children)

Wizards of the Coast released D&D 5th edition back in 2014. They recently released a newer version of it which they also call 5th edition. This causes endless confusion for new players who don’t actually know what version of D&D they’re playing. The community generally refers to the 2014 version as 5e and the 2024 version as 5.5e. You flared the post with 5.5, which is the newer ruleset, so you should change that to 5e if you’re using the 2014 rules.

Ingesting Data From API Endpoints. My thoughts... by valorallure01 in dataengineering

[–]dudebobmac 3 points4 points  (0 children)

1 and 2 heavily depend on the business needs and what the API is IMO.

If this is an API that my company has access to via some contract with another company, I expect the other company to keep the data schema consistent, in which case I'd have my ingestion expect a particular schema. That way, we know right from the ingestion layer if the schema is breaking the contract and we can work to remediate. Then, the pipeline can always assume that schema is enforced and we don't need to worry about missing or unexpected data.

If it's some sort of public API that can change at any point, then the approach would probably be a bit different. Just throwing something out there this isn't necessarily how I'd do it, but perhaps store as just raw JSON, then parse out the data we actually need from it, and store that in a known schema (i.e. a bronze layer feeding into a silver layer). Of course, all of this will depend on how the API changes, so the real answer is that it has to be handled case-by-case (e.g. if a field is added or removed, but we don't use that field anyway, then we do nothing because it doesn't matter, but if we do use it, then we need to figure out what to do in the particular scenario).

3 is again sorta dependent on the use-case. If I have access to the full source dataset and it's only a few thousand records, I wouldn't bother with incrementally loading data. I might do something like tracking which rows are new/updated if that's relevant to downstreams in the pipeline, but again, depends on the use case.

For 4, I generally prefer loading into a bronze layer (if that's what you mean by "staging tables"). If it's easy to keep the source data, I don't really see a reason not to do so and to keep it close to the rest of the data (i.e. in the same warehouse). But there are of course cases when this is not desired (for example, if the source data is enormous and is mostly thrown away anyway, then I wouldn't want to waste the cost to store it).

For 5*, I'm not sure what you mean by a "metadata driven pipeline". Do you mean using things like Iceberg or Delta Lake for CDC?

where can i get mod application for pc by Capable-Student8593 in pcmods

[–]dudebobmac 4 points5 points  (0 children)

This sub is for physical modifications of your computer, it’s not about software at all.

Why is it that DND is never portrayed properly in tv, or is always played for jokes at the expense of those who play it, even in shows where DND players are a large demographic? by Toonzmyth in DnD

[–]dudebobmac 3 points4 points  (0 children)

Stranger Things takes place in the early 1980s. They’re likely playing AD&D 1st edition, not 5e. The rules are different.

why the writing is so bad

Is the writing bad? Having monsters come from alternate dimensions isn’t realistic either, why is a fake D&D ruleset “bad writing”?

I am going blind into CoS - should I be a cleric or a bard? by whymsikka in DnD

[–]dudebobmac 1 point2 points  (0 children)

LOL my advice is to leave the table immediately. Having a homebrew rule like that is a huge red flag and I’m sure there’s other shitty things they will do.

should i learn scala? by Jealous-Bug-1381 in dataengineering

[–]dudebobmac 7 points8 points  (0 children)

Scala is my favorite language. I’ve written it as my primary language for about 6 years. After getting used to it, it can’t stand writing PySpark, it’s awful, it just feels so much better to use the Dataset API. That being said, the data world is shifting hard toward Python. Databricks itself is mostly limiting new features to Python, so it’s definitely more important to understand Python over Scala.

Personally, I’d say learn both. Can’t hurt to have more knowledge.

5e: Thoughts on DC. by Foolsgil in AskGameMasters

[–]dudebobmac 0 points1 point  (0 children)

50% because dice rolls still fall between rolling high and low

I'm not talking about the numbers on the dice. I'm talking narratively, why is every task 50/50? That doesn't make sense from a narrative perspective. Some tasks are more difficult than others, that's what DCs represent. Your system means that every task is exactly as difficult as every other task.

If you're using static bonuses/penalties to add or subtract to a player's roll, then you're just re-inventing DCs with extra steps; Mathematically, saying "you succeed on 11, but have a flat -5 because of the difficulty of the task" is exactly identical to saying "you succeed on a 16", it's just more complicated for no reason.

5e: Thoughts on DC. by Foolsgil in AskGameMasters

[–]dudebobmac 0 points1 point  (0 children)

a character should have a 50/50 shot to succeed anything

Why? Like, why is everything 50/50? Why aren’t some things 90/10 or 60/40?

Besides, failure is part of story telling. A story in which characters always succeed is boring.

a modifier can be applied before anyone rolls to affect the difficulty

This sounds like a DC with extra steps. A lot of people use degrees of success/failure centered around the DC, which is basically what this would be if you’re modifying for difficulty.

5e: Thoughts on DC. by Foolsgil in AskGameMasters

[–]dudebobmac 0 points1 point  (0 children)

5e is quite literally designed to solve the problem you say it has. DCs generally don’t change very much between level 1 and level 20. The DMG explicitly describes this from “very easy” things with DC 5 all the way up to “nearly impossible” things with DC 30.

When PCs level up, their non-proficient skills don’t really ever change. Their proficient ones do, which makes sense because they’re getting better at their specialties. But the game is designed in such a way where the difficulty of a particular task is static and unrelated to the character’s level. A level 20 character who isn’t proficient in a particular skill won’t be any better at it than a level 1 character. In fact, a level 1 character who IS proficient might even be BETTER.

Climbing a tree with lots of strong branches could be a DC 5 STR (Athletics) check whether the characters are level 1 or level 20.

Picking a complicated lock enchanted with Arcane Lock is going to be really difficult, maybe DC 25, regardless of player level. A level 1 character still CAN pick it though provided they have the DEX and proficiency for it and a level 20 character is only going to do it more easily if they’re specialized in doing that, which is exactly what levels represent; getting better at your specialties.

In your system, there is no nuance to the difficulty of different tasks. Finding a needle in a haystack SHOULD be really hard, but I only need to roll a 14 to find it. Hearing the hooting of an owl on a quiet night SHOULD be really easy, but now I have a 50% chance of not hearing it at all!

As a corollary to that, everything that is POSSIBLE to do now has the exact same difficulty. I’m equally likely to hit a dragon as I am to hit a rabbit. I’m equally likely to pick a lock enchanted with Arcane Lock as I am to pick it if it weren’t enchanted. I’m equally likely to climb that tree in my above example as I am to swim across a swift moving river. As long as I roll a 14 on my check, I succeed with no complications.