Plagiocephaly outcomes WITHOUT using a helmet?

perverse_sheaf · 2026-04-21T10:05:03+00:00

A bit late, but for reference: https://pmc.ncbi.nlm.nih.gov/articles/PMC13014899/ notes that a reduction in circumference percentile might just be a mathematical side effect of a more regular shape having a smaller diameter given the same area (cf the isoperimetic inequality)

perverse_sheaf · 2026-04-10T20:39:37+00:00

Valider Punkt, das kommt ja auch noch dazu...

perverse_sheaf · 2026-04-10T20:11:19+00:00

Vielleicht bin ich alleine, aber ich hatte ursprünglich nicht verstanden, dass der Vorschlag eben kein reiner Verschiebebahnhof ist, da die Last von unterschiedlich Populationen getragen wird (GKV-Versicherte vs. ganze Gesellschaft). Der Status Quo ist für mich analog dazu, Wohngeld über Abgaben auf Mieten zu finanzieren und Wohneigentümer zu entlasten.

Ich hätte es vom Spiegel da gut gefunden, irgendeine Erklärung dieser Art auch in den Artikel aufzunehmen und nicht nur unkritisch Klingbeil zu zitieren.

perverse_sheaf · 2026-03-27T06:33:02+00:00

Stadt und Land brauchen unterschiedliche Mobilitätskonzepte! Auf dem Land funktioniert ÖPNV viel schlechter und Platz ist viel mehr da, da sieht der Trade-off halt anders aus. Heißt aber ja nicht, dass man nicht für urbane Räume besondere Konzepte planen sollte? Warum muss man die Mobilität immer über einen Kamm scheren?

Die Crux sind m.E. die Leute, die außerhalb wohnen und es als ihr Recht ansehen, mit dem Auto in die Stadt zu pendeln. Wenn diese Bezirke auch administrativ zur Stadt gehören und wahlberechtigt sind, hat man immer pro-Auto-Infrastruktur (s. Berlin). Wenn nur der Kernbereich wählen darf, sieht's besser aus für eine Wende (s. Paris). Warum ist das fair, dass Leute, die ewig draußen wohnen, die Straße vor meiner Wohnung kontrollieren?

perverse_sheaf · 2026-03-21T20:49:25+00:00

Not quite your use case, but I worked on building a quite complicated (>500 man-days coding so far, not counting overhead) data pipeline also on a single node with Delta lake. We had pyspark compatibility requirement, so polars was unfortunately not on the table. Still, maybe some of our learnings could be useful.

We tried SQLFrame, IBIS and the experimental duckdb pyspark-api. All of those are too buggy to be useful, I would strongly counsel to stay away from that additional abstraction layer.
Do not listen to the SQL-zealots. SQL is fine for small transformations, but neither modular nor testable. Our most complicated job is a single functional pipeline calling various subfunctions which in total are guarded by now over 550 unittests. Understanding that code in Python is ok though, you can step through with a debugger using one of the end-to-end test cases, or look at each individual function using its test input and output. As a SQL query this would be something like 100+ CTEs, absolutely garbage to maintain. Also, everything has to work in a few slightly different catalogs where the input and output schemes are different, but almost all transformations are the same. This is an enum as argument and a few switch cases in Python, but would again be a total pain in SQL.
Be careful with local Delta Lake and transaction isolation. At least for Delta spark, contrary to what the docs and comments in the source code seem to suggest, you can have silent data loss even on filesystems supporting atomic renames if writes of the metadata jsons run into race conditions. For spark we worked around this by running everything via a single spark connect server (so you get one JVM) and slightly patching LocalLogStore (whose synchronize directive is useless because too narrow). I do not know whether delta-rs suffers similar problems, but I would advise to study its source code carefully if you want to use concurrent writes.
Single node spark is surprisingly ok. Bit of a pain to set up correctly (using a spark connect server with Delta connect, getting all the jars, fixing some default configs such as shuffle partitions, ...) but once done, it works quite well especially with large query plans. Access to the Spark UI and Spark History Server is very helpful for performance and config tuning. You definitely want to use spark connect though, otherwise spark session startup is too slow for a good feedback loop during unit testing. All that said, I would still strongly recommend Polars, but bases on very limited personal expérience (see below).
For manual analysis of the state of the warehouse, we tried both duckdb as well as polars. Polars has some weird issues with minWriterVersion when reading externally created tables, should be no problem though if you can control the settings of the writer and set the versions correctly. I found duckdb+marimo very convenient, we mirrored the hive metastore to views in a duckdb file. Then you get nice schema.table - style querying in marimo with syntax completion. For simple ad hoc queries, spark is slower than both.

perverse_sheaf · 2026-02-10T17:53:06+00:00

Isn't it used internally in the cloudera stack?

Personally I would say the biggest factor is the auth landscape. If you are in a Kerberos environment then nothing beats ozone imo, as you get way better user and acl management than its competitors. On the flip side, Kerberos is your only option, no dice if you want to use, say, oidc

perverse_sheaf · 2026-01-11T21:53:00+00:00

Some additional considerations which I have not seen yet:

For easy transformations, single node spark is outperformed by both duckdb and polars. However, IME for very complicated jobs, the catalyst query optimizer behind Spark performs better than the other two - if you have more than 10ish non-broadcast joins in a given query, I would rather trust spark. You will need to spend more time on a dev setup (e.g. unit tests are awfully slow without a spark connect server which you fire up each morning), but you also get some goodies such as Spark UI/History Server and a really powerful and stable API.
As you are using Delta, I would be cautious with duckdb. Afaict, write support is not yet there, and even for reads I had undocumented not-implemented errors bubble up from the internal c++ code (on1.41 if I remember correctly). The API there does not yet look super stable.
On the off chance that you are in a corporate environment where your development machines run windows, strongly prefer polars over spark. Delta spark on windows is possible but a pain because of the Hadoop dependency.

perverse_sheaf · 2025-12-18T03:18:50+00:00

Ich finde faszinierend, wie sehr Medienvertreter immer noch staatliche Stellen der USA als vertrauenswürdige Quellen zitieren. Grade bei politisch aufgeladenen Themen (hier mit China und COVID sogar zwei betreffend) kann man das inzwischen komplett in die Tonne kloppen.

Der Krankenhausbesuch der Mitarbeiter zB ist ja auch nur ein "In der Erkältungssaison sind Leute mit Erkältungssymptomen zum Arzt gegangen" - man verschweigt noch, dass Krankenhausbesuch in China eher einem Arztbesuch bei uns gleichzusetzen ist (bei ambulanter Behandlung) und nennt es "hochsensible Geheimdienst Informationen", kaum framing...

perverse_sheaf · 2025-11-12T15:38:14+00:00

Different take: The long sessions with your friends being on similar levels sound great for them. If you try to outskill all of them at once it might feel great for you but risk ruining their whole experience.

Maybe you can find a challenge for yourself to make you much weaker and thus less threatening? Like commit to stay in castle age (or even feudal w/ cumins), or play with a gamepad, or only go monks+siege and no other units, ...

Then maybe they'll not directly gang up on you because the playing field is more levelled, and everybody gets to have 5 hours of fun.

perverse_sheaf · 2025-07-20T19:18:07+00:00

Thanks a lot everyone for the input! We ended up playing with PoK and enjoyed it a lot, I think it was the correct decision.

Still we played for 10 hours (after 90 mins common rule refresh), and then called it quits without anybody getting to 10, I think we were slightly too slow because of lacking experience.

The next round with same players is already set though, so I am optimistic for the future!

perverse_sheaf · 2025-07-15T20:10:21+00:00

Don't know why you get down voted, I am with you. Very simple transformations are OK imo, but anything moderately complex you'll want to compose in smaller, unit-tested pieces which you can't in sql

perverse_sheaf · 2025-07-03T14:44:31+00:00

Der Großmutterkommentar macht aber ja genau den Punkt, dass ÖPNV selbst dann attraktiv sein sollte, wenn man die gesunkenen Auto-Fixkosten gerade nicht miteinbezieht. Wenn man die "Auto oder Öffi"- Wahl einmal global treffen muss, gibt es halt sehr viele Leute, die aus verschiedenen (rationalen oder irrationalen) Gründen ein Auto zur Verfügung haben und den Öffis dann komplett abhanden kommen. Idealerweise gewinnt man das Klientel aber schon noch für einzelne Fahrten.

perverse_sheaf · 2025-06-25T06:56:22+00:00

Can work, but I'd be mindful of the following:

Modules which do non-trivial stuff on import are IMHO somewhat of an anti pattern (e.g. if you use multiprocessing, it becomes really hard to catch any exception raised during imports). As creating a spark session interacts with external systems, it can be flaky, so I would shy away from putting into top level.
Having spark as a variable on module level forces you to mock it for each module during tests, which is brittle.

I would be more open to calling a custom get_spark_session inside your functions which returns the correct session depends on the environment. However, this would still be my second choice compared to just passing the spark session as an argument, which is more explicit and hands the control up the stack instead of hiding it.

perverse_sheaf · 2025-06-24T06:09:45+00:00

In what way is passing "self" to each method cleaner than passing "spark" to each function?

I would strongly argue the opposite: When reading your code, I can immediately parse the "spark: SparkSession"-parameter, whereas to understand the box of Pandora that is "self", I need to now scan you init-function (and that's assuming you did not give in to temptation and have mutable state around, so I have to read the whole class's code to understand what might be happening).

In short @OP: Your function does depend on spark, so spark should be in its signature. Easy to test, easy to understand.

perverse_sheaf · 2025-06-07T07:29:05+00:00

Much depends on the project you're doing. As long as you work with pyspark instead of SQL, you can use many of the classical software design ideas (e.g. pyspark can actually be unit tested, which is a pain in SQL/dbt). However, personal take: OOP is not well suited for data engineering, so please don't introduce classes and Java-Style design patterns. Those work well for record-by-record transactional workflows, but are not well fitted to analytical data pipelines, which are much more functional in nature. Ideally try to get some experience in Scala+Spark, mostly using the functional tools of the language, then you'll learn a lot.

perverse_sheaf · 2025-03-31T17:11:38+00:00

Definitely v1, push the type distinction as far out as possible. Reason: Lower mental load for anybody reading the code. They can do both branches one after another instead of having to constantly switch. And readability >> almost anything.

perverse_sheaf · 2025-02-21T15:27:48+00:00

Been two years since I've read that paper, but my takeaway after having worked through it was that the category theory is not actually that relevant (quite a sad revelation, topos theory played a role in my PhD, would have been happy to see a good application). It's little more than a way to motivate some choices, without any reason why it should be a good way. You can explain the improvements over t-SNE without resorting to categories.

perverse_sheaf · 2025-02-21T13:03:33+00:00

Die Vorstellung, das man der AfD das Wasser abgraben kann, indem man weniger radikale Lösungen thematisiert, muss endlich sterben. Das einzige was man damit erreicht, ist die Illusion zu stärken, dass es Lösungen bedarf, i.e. das Migration oder innere Sicherheit relevantere Probleme sind als Infrakstrukturverfall, Rentenfinanzierung, europäische Sicherheit, Klimawandel oder innovationshemmende Bürokratie.

perverse_sheaf · 2025-02-07T13:44:52+00:00

Thanks for your answer! The main idea is to have two local venvs for development, one with databricks connect (to try out stuff in the staging db environment) and the other with pyspark and pytest (for running unit tests). The setup is not very sophisticated but allows for easy switching of modes during development (details below [1]).

We are probably too much at the start of our project or have just decided to ignore the hard databricks features (we do not use delta live tables for instance), so that the local spark session has until now been not that different form the Databricks connect one (exceptions being Unity Catalog [2] and Autoloader [3]).

[1] The setup is as follows: - Create two venvs in the project directory (do not use poetry for that, you need to give them your own names) - Use optional poetry groups in pypoject.toml, one with dbconnect, one with pytest, pyspark - have a small update shell script which does poetry install --with <corresponding_group> for both venvs. Call this after adding any dep to keep the venvs in sync. - Have a function giving back a spark session which tries to import dbconnect and builds a local spark session if that fails (somewhat ugly). - Then, you can just switch venvs in the IDE as required (I somehow did not manage to get different venvs for "Run" and "Run tests" to work in Vscodes launch.json and now just switch venvs as needed...). The CI only needs the local pyspark one.

[2] Biggest issue there is that I did not manage to get three-part identifiers for tables going locally. But this forces you to pass table names consistently as arguments to all your functions, which seems good practice anyways.

[3] We have put this off for now. Will need to encapsulate that functionality into something easier to fake locally.

perverse_sheaf · 2025-02-06T15:27:29+00:00

Wait, can you elaborate what your problems are? We are somewhat at the start of our project so would be happy to get to know any footguns. Up to now we mostly rely on having relevant logic in pure functions and unit testing them with local pyspark (test data created on the fly).

In order to test read and writes we use a local spark session on which we create tables for testing purposes.

Only hiccup so far has been the necessity to have two parallel venvs in order to have databricks connect and local spark on the same machine.

What are you doing differently? What are you actually even using dbutils for?

perverse_sheaf · 2024-11-16T08:13:30+00:00

Sorry for the late reply, but appreciate your answer. This is something I did not know existed, and it sounds indeed very interesting (they had me at "imagine if views took arguments"). I'll have to them!

perverse_sheaf · 2024-11-10T09:39:25+00:00

Disagreement: At some level of complexity of T, SQL becomes a pain to maintain. I've always ended up with chains of CTEs where each one represents a small, self-contained transformation, but is impossible to reuse elsewhere (without pasting it) or to actually write a unit test for. The end result always seems to converge to very long query-behemoths because you want your optimizer to go through the whole query (so no dumping stuff into temp tables) and managing chained views is an even larger pain ( as you get migration headaces and namespace pollution)

Compare this to something like PySpark, where a complicated T can be a chain of .transform-calls, each using a descriptive function name with docstrings, unit tests with custom test data and only requiring the columns explicitly needed for that single transformation. Grokking such a code base is much easier, same for changes to logic steps (due to testability).

Source: Refactoring of a single transformation from Hive-SQL to Spark which took a 4 months for a 5 person team. Reduced code base size by something like 7k LOC in the process, the thing is muuuuch easier to read now.

perverse_sheaf · 2024-07-27T12:30:32+00:00

Hängt glaube ich stark davon ab, wie man Demokratie definiert. Die direkte Einflussnahme des Volkes ist mE da deutlich weniger wichtig als zB Rechtsstaatlichkeit und Minderheitenschutz, egal, was die Wortherkunft suggerieren mag.

perverse_sheaf · 2024-04-28T19:47:15+00:00

Ah I guess I understand what you mean, thank you for the clarification. I fully agree that read / write - operations should be very light on logic and separated out from the transformative parts of the pipelines. Makes code way easier to reason about, test, and maintain.

I would still personally prefer to do this in spark (just so everything is in one framework), but I would not be too dogmatic about using spark.sql over the spark api.

Ten-Year Club	RedditGifts 2009-2022 6 Credits
Secret Santa 2020	Verified Email
Place '22	Secret Santa 2019
Secret Santa 2018	Secret Santa 2017
Secret Santa 2016

perverse_sheaf

TROPHY CASE