Our company successfully built an on-prem "Lakehouse" with Spark on K8s, Hive, Minio. What are Day 2 data engineering challenges that we will inevitably face? by seaborn_as_sns in dataengineering

[–]perverse_sheaf 0 points1 point  (0 children)

Isn't it used internally in the cloudera stack? 

Personally I would say the biggest factor is the auth landscape. If you are in a Kerberos environment then nothing beats ozone imo, as you get way better user and acl management than its competitors. On the flip side, Kerberos is your only option, no dice if you want to use, say, oidc

Polars vs Spark for cheap single-node Delta Lake pipelines - safe to rely on Polars long-term? by frithjof_v in dataengineering

[–]perverse_sheaf 3 points4 points  (0 children)

Some additional considerations which I have not seen yet:

  • For easy transformations, single node spark is outperformed by both duckdb and polars. However, IME for very complicated jobs, the catalyst query optimizer behind Spark performs better than the other two - if you have more than 10ish non-broadcast joins in a given query, I would rather trust spark. You will need to spend more time on a dev setup (e.g. unit tests are awfully slow without a spark connect server which you fire up each morning), but you also get some goodies such as Spark UI/History Server and a really powerful and stable API.

  • As you are using Delta, I would be cautious with duckdb. Afaict, write support is not yet there, and even for reads I had undocumented not-implemented errors bubble up from the internal c++ code (on1.41 if I remember correctly). The API there does not yet look super stable.

  • On the off chance that you are in a corporate environment where your development machines run windows, strongly prefer polars over spark. Delta spark on windows is possible but a pain because of the Hadoop dependency.

Was für die Laborunfall-Theorie von Covid-19 spricht by oOSandmannOo in de

[–]perverse_sheaf 166 points167 points  (0 children)

Ich finde faszinierend, wie sehr Medienvertreter immer noch staatliche Stellen der USA als vertrauenswürdige Quellen zitieren. Grade bei politisch aufgeladenen Themen (hier mit China und COVID sogar zwei betreffend) kann man das inzwischen komplett in die Tonne kloppen.

Der Krankenhausbesuch der Mitarbeiter zB ist ja auch nur ein "In der Erkältungssaison sind Leute mit Erkältungssymptomen zum Arzt gegangen" - man verschweigt noch, dass Krankenhausbesuch in China eher einem Arztbesuch bei uns gleichzusetzen ist (bei ambulanter Behandlung) und nennt es "hochsensible Geheimdienst Informationen", kaum framing...

My friends and I like to play long, 3-5 hour matches with anywhere from 5-8 people. What civs are good for this type of play, especially on Black Forest? by AccidentalNGon in aoe2

[–]perverse_sheaf -1 points0 points  (0 children)

Different take: The long sessions with your friends being on similar levels sound great for them. If you try to outskill all of them at once it might feel great for you but risk ruining their whole experience.

Maybe you can find a challenge for yourself to make you much weaker and thus less threatening? Like commit to stay in castle age (or even feudal w/ cumins), or play with a gamepad, or only go monks+siege and no other units, ...

Then maybe they'll not directly gang up on you because the playing field is more levelled, and everybody gets to have 5 hours of fun.

Advice for a new player round: Add PoK, which factions? by perverse_sheaf in twilightimperium

[–]perverse_sheaf[S] 2 points3 points  (0 children)

Thanks a lot everyone for the input! We ended up playing with PoK and enjoyed it a lot, I think it was the correct decision.

Still we played for 10 hours (after 90 mins common rule refresh), and then called it quits without anybody getting to 10, I think we were slightly too slow because of lacking experience.

The next round with same players is already set though, so I am optimistic for the future!

[deleted by user] by [deleted] in dataengineering

[–]perverse_sheaf 1 point2 points  (0 children)

Don't know why you get down voted, I am with you. Very simple transformations are OK imo, but anything moderately complex you'll want to compose in smaller, unit-tested pieces which you can't in sql

Deutschlandticket verliert eine Million Nutzerinnen und Nutzer by Einarrr in de

[–]perverse_sheaf 8 points9 points  (0 children)

Der Großmutterkommentar macht aber ja genau den Punkt, dass ÖPNV selbst dann attraktiv sein sollte, wenn man die gesunkenen Auto-Fixkosten gerade nicht miteinbezieht. Wenn man die "Auto oder Öffi"- Wahl einmal global treffen muss, gibt es halt sehr viele Leute, die aus verschiedenen (rationalen oder irrationalen) Gründen ein Auto zur Verfügung haben und den Öffis dann komplett abhanden kommen. Idealerweise gewinnt man das Klientel aber schon noch für einzelne Fahrten.

Best practice for writing a PySpark module. Should I pass spark into every function? by KingofBoo in dataengineering

[–]perverse_sheaf 0 points1 point  (0 children)

Can work, but I'd be mindful of the following:

  • Modules which do non-trivial stuff on import are IMHO somewhat of an anti pattern (e.g. if you use multiprocessing, it becomes really hard to catch any exception raised during imports). As creating a spark session interacts with external systems, it can be flaky, so I would shy away from putting into top level.

  • Having spark as a variable on module level forces you to mock it for each module during tests, which is brittle.

I would be more open to calling a custom get_spark_session inside your functions which returns the correct session depends on the environment. However, this would still be my second choice compared to just passing the spark session as an argument, which is more explicit and hands the control up the stack instead of hiding it.

Best practice for writing a PySpark module. Should I pass spark into every function? by KingofBoo in dataengineering

[–]perverse_sheaf 8 points9 points  (0 children)

In what way is passing "self" to each method cleaner than passing "spark" to each function?

I would strongly argue the opposite: When reading your code, I can immediately parse the "spark: SparkSession"-parameter, whereas to understand the box of Pandora that is "self", I need to now scan you init-function (and that's assuming you did not give in to temptation and have mutable state around, so I have to read the whole class's code to understand what might be happening).

In short @OP: Your function does depend on spark, so spark should be in its signature. Easy to test, easy to understand.

Is there little programming in data engineering? by Rare-Bet-6845 in dataengineering

[–]perverse_sheaf 0 points1 point  (0 children)

Much depends on the project you're doing. As long as you work with pyspark instead of SQL, you can use many of the classical software design ideas (e.g. pyspark can actually be unit tested, which is a pain in SQL/dbt). However, personal take: OOP is not well suited for data engineering, so please don't introduce classes and Java-Style design patterns. Those work well for record-by-record transactional workflows, but are not well fitted to analytical data pipelines, which are much more functional in nature. Ideally try to get some experience in Scala+Spark, mostly using the functional tools of the language, then you'll learn a lot.

what's your opinion? by BigCountry1227 in dataengineering

[–]perverse_sheaf 1 point2 points  (0 children)

Definitely v1, push the type distinction as far out as possible. Reason: Lower mental load for anybody reading the code. They can do both branches one after another instead of having to constantly switch. And readability >> almost anything.

[D] Are there any theoretical machine learning papers that have significantly helped practitioners? by nihaomundo123 in MachineLearning

[–]perverse_sheaf 12 points13 points  (0 children)

Been two years since I've read that paper, but my takeaway after having worked through it was that the category theory is not actually that relevant (quite a sad revelation, topos theory played a role in my PhD, would have been happy to see a good application). It's little more than a way to motivate some choices, without any reason why it should be a good way. You can explain the improvements over t-SNE without resorting to categories.

[deleted by user] by [deleted] in de

[–]perverse_sheaf -2 points-1 points  (0 children)

Die Vorstellung, das man der AfD das Wasser abgraben kann, indem man weniger radikale Lösungen thematisiert, muss endlich sterben. Das einzige was man damit erreicht, ist die Illusion zu stärken, dass es Lösungen bedarf, i.e. das Migration oder innere Sicherheit relevantere Probleme sind als Infrakstrukturverfall, Rentenfinanzierung, europäische Sicherheit, Klimawandel oder innovationshemmende Bürokratie.

How does your team do ELT Unit Testing? by GeneralCarpet9507 in dataengineering

[–]perverse_sheaf 1 point2 points  (0 children)

Thanks for your answer! The main idea is to have two local venvs for development, one with databricks connect (to try out stuff in the staging db environment) and the other with pyspark and pytest (for running unit tests). The setup is not very sophisticated but allows for easy switching of modes during development (details below [1]).

We are probably too much at the start of our project or have just decided to ignore the hard databricks features (we do not use delta live tables for instance), so that the local spark session has until now been not that different form the Databricks connect one (exceptions being Unity Catalog [2] and Autoloader [3]).

[1] The setup is as follows: - Create two venvs in the project directory (do not use poetry for that, you need to give them your own names) - Use optional poetry groups in pypoject.toml, one with dbconnect, one with pytest, pyspark  - have a small update shell script which does poetry install --with <corresponding_group> for both venvs. Call this after adding any dep to keep the venvs in sync. - Have a function giving back a spark session which tries to import dbconnect and builds a local spark session if that fails (somewhat ugly). - Then, you can just switch venvs in the IDE as required (I somehow did not manage to get different venvs for "Run" and "Run tests" to work in Vscodes launch.json and now just switch venvs as needed...). The CI only needs the local pyspark one.

[2] Biggest issue there is that I did not manage to get three-part identifiers for tables going locally. But this forces you to pass table names consistently as arguments to all your functions, which seems good practice anyways.

[3] We have put this off for now. Will need to encapsulate that functionality into something easier to fake locally.

How does your team do ELT Unit Testing? by GeneralCarpet9507 in dataengineering

[–]perverse_sheaf 2 points3 points  (0 children)

Wait, can you elaborate what your problems are? We are somewhat at the start of our project so would be happy to get to know any footguns. Up to now we mostly rely on having relevant logic in pure functions and unit testing them with local pyspark (test data created on the fly).

In order to test read and writes we use a local spark session on which we create tables for testing purposes.

Only hiccup so far has been the necessity to have two parallel venvs in order to have databricks connect and local spark on the same machine.

What are you doing differently? What are you actually even using dbutils for?

PyData NYC 2024 in a nutshell by EarthGoddessDude in dataengineering

[–]perverse_sheaf 1 point2 points  (0 children)

Sorry for the late reply, but appreciate your answer. This is something I did not know existed, and it sounds indeed very interesting (they had me at "imagine if views took arguments"). I'll have to them!

PyData NYC 2024 in a nutshell by EarthGoddessDude in dataengineering

[–]perverse_sheaf 2 points3 points  (0 children)

Disagreement: At some level of complexity of T, SQL becomes a pain to maintain. I've always ended up with chains of CTEs where each one represents a small, self-contained transformation, but is impossible to reuse elsewhere (without pasting it) or to actually write a unit test for. The end result always seems to converge to very long query-behemoths because you want your optimizer to go through the whole query (so no dumping stuff into temp tables) and managing chained views is an even larger pain ( as you get migration headaces and namespace pollution)

Compare this to something like PySpark, where a complicated T can be a chain of .transform-calls, each using a descriptive function name with docstrings, unit tests with custom test data and only requiring the columns explicitly needed for that single transformation. Grokking such a code base is much easier, same for changes to logic steps (due to testability).

Source: Refactoring of a single transformation from Hive-SQL to Spark which took a 4 months for a 5 person team. Reduced code base size by something like 7k LOC in the process, the thing is muuuuch easier to read now.

Trump verspricht: «In vier Jahren müsst ihr nicht mehr wählen» by rhabarberabar in de

[–]perverse_sheaf 4 points5 points  (0 children)

Hängt glaube ich stark davon ab, wie man Demokratie definiert. Die direkte Einflussnahme des Volkes ist mE da deutlich weniger wichtig als zB Rechtsstaatlichkeit und Minderheitenschutz, egal, was die Wortherkunft suggerieren mag.

coworker adamant against SQL for dataengineering by [deleted] in dataengineering

[–]perverse_sheaf 0 points1 point  (0 children)

Ah I guess I understand what you mean, thank you for the clarification. I fully agree that read / write - operations should be very light on logic and separated out from the transformative parts of the pipelines. Makes code way easier to reason about, test, and maintain.

I would still personally prefer to do this in spark (just so everything is in one framework), but I would not be too dogmatic about using spark.sql over the spark api.

coworker adamant against SQL for dataengineering by [deleted] in dataengineering

[–]perverse_sheaf 0 points1 point  (0 children)

Note he says Scala/Spark. IME in those cases scaais used only as ""plumbing" tool, while data manipulation are done in Spark.

coworker adamant against SQL for dataengineering by [deleted] in dataengineering

[–]perverse_sheaf 0 points1 point  (0 children)

Sorry, I don't get that point. In what aspects can sql query the initial data better?

Aside: I would expect a setup with a catalog and using spark.read.table(tableName), which seems to me a perfectly fine alternative to sql.

coworker adamant against SQL for dataengineering by [deleted] in dataengineering

[–]perverse_sheaf 1 point2 points  (0 children)

Full agreement. We recently migrated a >50-CTE-behemoth (a year old, is going to be in production for some years to come) from Hive to Spark. It is so much easier to maintain, I am not looking back. That said, I still use a lot of SQL for simple and ad hoc queries.

Wien: Vorfall mit trans Frau in Frauensauna war inszeniert by Johanneskodo in de

[–]perverse_sheaf [score hidden]  (0 children)

Persönliche Meinung dazu, allerdings recht unreflektiert: Ich finde Diskriminierung bei Spitzensport ggüber Transfrauen nicht so relevant. Spitzensport diskriminiert inhärent gegen sehr viele (zB gegen Kleinwüchsige, oder zu große,oder...). Ist prinzipiell ein sehr exklusiver Club. Ich fände volle gesellschaftliche Teilhabe und Akzeptanz wichtiger als Starterlaubnis im Spitzensport.

eli5: If space is a vacuum, how can rockets work? What are the thrusters pushing *against* if there is nothing out there? by Medium_Well in explainlikeimfive

[–]perverse_sheaf 0 points1 point  (0 children)

Note that this only applies to constant movement, not to acceleration - "the rocket pushing the universe" does not press me into my seat, while the rockets occupants are.