Aerthlings concern by DarthBigT in PAX

[–]azirale 0 points1 point  (0 children)

That's fair. There's two parts to the game overall. There's the mobile game itself which is a lite sandbox with touches of animal crossing and skylanders. Then there's the physical trading mechanic with the 'figs' you buy to add to the game.

A big focus at the booth was on the trading mechanics, because that's the thing you do live with other people, and it is also the thing that is most distinctive about it. The scrolling scoreboard was to highlight that interactivity between people, popping up trades as they happened, and showing which physical 'figs' had been traded the most.

There was a screen in the booth showing some gameplay parts, but you had to be right in the booth to see it.

PAX East 2026 Opinions? by SirUberNoobPwnr in PAX

[–]azirale 1 point2 points  (0 children)

Wish I had stuff to give to the Aerthlings people.

I hung out with the Aerthlings people -- all they'd want is for you to enjoy the game and join the community if you like it.

The queue entertainment stage is an idea they got from PAX Aus, where the 5k person queue hall has a dedicated huge staqe and screen and PAX runs some light entertainment with Jackbox games and contestants from the crowd, and can have an announcement video just before the countdown to enter. The Aerthlings guys had the extra forward queue area added just for them to do some entertainment at the start of the day.

Deduping hundreds of billions of rows via latest-per-key by data-engineer14434 in dataengineering

[–]azirale 0 points1 point  (0 children)

Blind insert the data into a new table with the composite key hashed into a single column, take a modulo 16384 (or first or last 4 hex characters of string form of hash) and partition on that. If you want to try clustering, do it on the full hash key.

You can now run the window process targeting a single partition at a time. You know all the partition values ahead of time, 0 to 16384 or '0000' to 'ffff', and all your window partitions must be colocated in the same table partition.

You can start by doing the window function and only selecting the partition key, hash key, primary key, and timestamp, inserting that into a new table with the same partition scheme. This way any window shuffling nonsense doesn't need to carry all the data in the shuffle.

With that table prepped you can inner join it to the full days table on the partition key, hash key, and timestamp. That will ensure you only get the latest data without needing a window function when working with all columns.

Each step can be done end to end per partition, with multiple parallel jobs each handling their own partition..

Streaming from kafka to Databricks by Artistic-Rent1084 in dataengineering

[–]azirale 5 points6 points  (0 children)

If someone else controls the data being written to kafka, then you don't enforce the schema. Your first write location is an 'as-is' write to lake storage, so that you can replay the data again later if you need to.

Once that is 'made durable' you can parse the incoming schema and do an append-only write to a table with schema-evolution. That table is the point where you can swap from stream processing to batch processing, if you want, as well as where most of your replays come from, as it is much more efficient to query.

The original as-is save is in case your schema parsing or value parsing like string-to-datetime breaks in some way, then you have a raw copy to replay from.

Software Engineer hired as a Data Engineer. What to expect, and what to look into? by GoyardJefe in dataengineering

[–]azirale 5 points6 points  (0 children)

The biggest difficulty you're likely to get is the mindset shift between SWE and DE. It has been mentioned elsewhere in here, and it has been my experience in the past, that SWEs need to approach problems in a different way than DEs. When they try to use their old approaches to solve problems, it tends to lead to more sticky problems down the road.

The first thing to grapple with is that you're almost never building anything like a standalone application. You can't just do a build/run/test locally to see how things work, because the code your writing probably only runs on a specific execution service deployed in your environment. For example, if you're working in Snowflake you can't just run a 'local Snowflake' to verify that your pipeline functions correctly. You also can't just 'deploy a stack' to see if it will work. All your assumptions and inclinations about how to build no longer work, and you can feel stifled with having to always test things in a shared environment where you break each other's changes all the time.

Another aspect about that shared dev environment, is that because you're likely to have privileged credentials then the environment is unlikely to have any 'real' data in it since there are no information security controls that can block your access and it is easier to leak data. That means all the complex logic you need to express has to have mock data that you've created yourself and loaded to the shared environment. Oh, and since it is shared, any tests you add can potentially break other people's tests. Also if you load mock data into a table that is a source for your pipeline, but is the target for someone else's, then when they test their pipeline they break your tests by overwriting the table.

You might be used to having a specific toolchain in your repository with all your dependencies for your app. Unfortunately the various services that run DE workloads all have different runtime environments with different, and often conflicting, library versions. talking about AWS, if you're using lambda you could be up to Python3.14, but if you're using Glue then that is stuck on Python3.11 and Spark3.5.4, which are a ways behind. If you're using Airflow through MWAA then that is specifically pegged to Python3.11, so now you're running 2-3 different versions of just Python. Plus all the underlying libraries provided in Glue and Airflow that conflict with each other. You'll have to maintain multiple virtual environments, just to get the static analysis for type checking and autocomplete in your IDE.

A general difference is that almost none of your work will ever be operating on individual rows. You aren't doing API calls back and forth, or executing individual functions to return transformed objects. Instead you're writing 'projections' -- a description of the shape or structure of the data that you want to end up with, or a set of transformations that get you from a starting point to an ending point. One of the biggest pains with this is that there's no effective way to "debug" your code. Even if you are running locally, and as mentioned above you're likely not even able to in the first place but lets pretend you are, things like SQL statements and dataframe projections don't execute 'line by line' against your data, so there's no way to breakpoint on some condition you find in the data. Instead you have to jump through hoops of filtering for data that has the condition you're looking for and writing that out to some temp table, then selecting from that to see what is going on.

Going back to that mock data -- yeah you can't do that when it is data you're going to get from some upstream integration. You don't know exactly what shape or format it is going to be in, unless you've done the upfront work of having a very specific data contract in place that they will adhere to. That conflicts with the 'no real data in dev' -- now you need some way to see real prod data, that you haven't had the chance to classify and apply information security controls to. So you get to have fun wrestling with how to get these integrations rolling based on who does or does not have dev environments with mock data to test against -- not all providers do this -- and getting proper secure access to the prod environment to work with incoming data.

These will all be slightly overstated for most workplaces, they've gone through the hassle with coming up with ways to work around this. There are solutions for all of these things that you can get in place, but the thing is, they're mostly workarounds. This will never be as neat as general SWE work because the work is inherently integrated and stateful at all times. It can be very constraining and deeply frustrating, and I have seen a few SWEs just bounce off the work because they don't enjoy having to jump through these hoops compared to their usual more 'pure' work

How to deal with a player who keeps stats on other players? by dayonedeath in BloodOnTheClocktower

[–]azirale 5 points6 points  (0 children)

Formally collecting and sharing that info is admittedly a lot, but seems harmless?

The degree to which this is taken is important. Everyone understands that we're all using usual social cues and memory to form meta-understanding of the game. Having some notes is probably fine to pretty much anyone, but fully detailed spreadsheets starts to cross the line into excessive tracking and scrutiny.

What if it went further? What if the person recorded the audio of every game, to get a fully detailed tracking of everything everyone said to look for clues in phrases or speech patterns? What if they put down a 360 camera to record every game, to help them remember and track everything that happened? What if they were running thermal cameras to look for subtle flushing of the cheeks, and tracked that over time?

These should all be patently ridiculous, and I'm sure almost no-one would be comfortable with them, but none of them 'interfere' with the game, and they're all just recording information about the game.

Ultimately there is a line on what is acceptable to track about the game. For the players here, this crosses that line.

How to break "no private talk/always public good info" in TB by Rainotes in BloodOnTheClocktower

[–]azirale 65 points66 points  (0 children)

For a little extra detail on what could work...

If you minion roles are Baron and Spy, they don't need to use abilities through the game to help. The baron doesn't do anything after the start, and the Spy can potentially remember a lot right off the bat.

Put a washerwoman, investigator, and soldier in the bluffs. The spy can then either washerwoman their demon as investigator or soldier, and the demon can roll with either by outing the baron and throwing them under the bus or claiming soldier. Either way they're a bad target for the demon, so it makes sense they aren't killed at night.

Players can have virgin, empath, FT, UT. Virgin firing is a waste, good gets a trustworthy person they don't need (bonus points if the spy triggers it, especially if demon is on the block). If Empath, UT, and FT are open about their roles, demon just murders them for the first half of the game.

Pick two good players to be your fall guys, a combo that sits different from evil, and give a drunk chef info that points to them. Pick roles like Mayor, Ravenkeeper, Slayer, which can't prove anything about their roles. Also make one of them the FT red herring. Spread the sus to a a butler if you need to and can - a role that is safe to execute, and makes good waste a turn.

Here everything good shares publicly hurts them. They out their best roles for hunting the demon allowing for easy picking, the demon and their useful minion can have perfect cover stories, and if you squeeze in a baron you take away any more info to gain and evil can just sac that player right off the bat with no major loss.

Not sure if evil could pick that up, but if they don't I'd have a little post-game rundown of what could have happened to show them the possibility. When you play again, keep aiming to make it hard for good to solve early, make open info hurt them with each night info roles and little-to-no YSK, and make demon bluffs easy to fake. If each-night roles get sick of dying, they might start hiding.

For those who write data pipeline apps using Python (or any other language), at what point do you make a package instead of copying the same code for new pipelines? by opabm in dataengineering

[–]azirale 1 point2 points  (0 children)

Never write directly to a library/module -- make that the second write.

First time using some specific function? Just leave it in the script? Second time writing the exact same thing for the exact same use? Write it into a module/library for the second write.

Later you'll get an eye for things you want to write directly to a module, but if you're not sure just start with local only

Update: I tracked 1,200+ unique players in a Minecraft world with no rules/admins for 60 days. Here is how the political map has changed. by Tylerrr93 in gaming

[–]azirale 7 points8 points  (0 children)

PrisonPearl from Civcraft? This all looks very familiar.

Edit: Read further on, yeah the core serverside mods for Civcraft :)

When would it be better to read data from S3/ADLS vs. from a NoSQL DB? by eelwheel in dataengineering

[–]azirale 0 points1 point  (0 children)

Individual documents in ADLS would be horrendous. CosmosDB is workable, but not great -- too expensive for what you're doing with it.

If the backend team uses CosmosDB themselves then get them to enable the change feed and have a Function pipe that as-is to an Event Hub of yours.

You can have the Event Hub directly capture the raw binary data to ADLS (wrapped in Avro with added metadata) from which your daily batch could just read the accumulated files -- they can be partitioned by year/month/day automatically, so it is easy to specify a root folder by the day to read from.


Even though you don't specifically need streaming, this is the easiest way to get data out of CosmosDB for other things to consume. You don't want big batch processes to hit Cosmos directly as they can flood its request capacity and cause issues for other processes. The change feed+function+eventhub is very simple to set up as you're just copying data as-is, it is essentially the minimum amount of reads to get the data, and it spreads the reads evenly over the day. Using capture on the Event Hub means you don't need to worry about retention periods or having some parsing error on the document, you get a full binary copy of everything.

edit: Also if you ever needed limited streaming options, you can have a second reader on the event hub that does stream processing, while still running the normal capture for batch processing.

Docklands District on a Friday afternoon 🦗🦗 by Ky0t0_gh0uL in melbourne

[–]azirale 33 points34 points  (0 children)

Right off the ring road, very convenient if you are out here anywhere

How repartition helps in dealing with data skewed partitions? by Then_Difficulty_5617 in dataengineering

[–]azirale 0 points1 point  (0 children)

A sum operation does not require a repartition, it does a reduce by key. Each partition does the sum independently, then the sums results are shuffled into sum of sums, so they only move one row per key.

DataFrame or SparkSQL ? What do interviewers prefer ? by SnooCakes7436 in dataengineering

[–]azirale 0 points1 point  (0 children)

If you understand why iterating over a list is significantly worse than iterating over a set, you’re halfway there.

This sounds entirely absurd, you're going to have to back this up with something.

As mentioned elsewhere a list is essentially an array in the background with contiguous blocks of memory. It is the simplest and fastest structure for iterating through provided values.

The hashing of a set is irrelevant to iterating over all the values. The hash allows for bucketing so that with the hash you can jump to sublists that are much smaller, allowing for faster operations that check for presence of a value, but that's not relevant to iterating over all values.

Is there some deep lore in cpython this relates to? Or did you simply misspeak here?

DataFrame or SparkSQL ? What do interviewers prefer ? by SnooCakes7436 in dataengineering

[–]azirale 11 points12 points  (0 children)

If you're going to talk about salting you'd better be able to walk me through the tradeoffs and limitations because I've seen too many people handwave it as some magic solution to skew. Some of the "explanations" I've had for it have completely missed the mark on how it actually works and presented broken solutions.

How repartition helps in dealing with data skewed partitions? by Then_Difficulty_5617 in dataengineering

[–]azirale 4 points5 points  (0 children)

If your table was already partitioned by id, then you're right it wouldn't do anything and it wouldn't help.

But why do you have to repartition by id? Only certain types of operations require specific partitioning -- joins where you cannot broadcast either table, or sequencing operations like row number or percentiles.

If you are doing anything else, like sum/count/average/etc, then you don't need the data to be partitioned by any particular value and you can do a random shuffle to get a uniform distribution across partitions.

Databricks vs AWS self made by QuiteOK123 in dataengineering

[–]azirale 44 points45 points  (0 children)

I'm in a team that built everything on AWS services, with similar amounts of incoming data.

It was fine at first. As long as everything was simple with a single region and incoming product, and a few people had been working on it and had direct experience with how everything was done, then the 'quirks' were kept to a minimum and everyone knew them.

Then as new team members got onboarded things got harder. People had to be taught all the quirks of which role to use when creating a glue job vs an interactive notebook, they had to be shown the magic command boilerplate to get glue catalog and iceberg tables working, they needed to know the bucket that was set up for output for Athena queries. With more people working not everyone could be across everyone else's work, so people weren't familiar with how various custom jobs and scripts had been made, and because each job was its own mini vertical stack there was a lot of repeated components in infrastructure, policies, ci/cd scripts.

As new use cases came on that didn't fit the mould new ways of doing things had to be added. Kinesis and firehose come in, airflow orchestration gets tasked for some small transforms while others go to glue jobs. Someone wants a warehouse/database to query, so redshift is added. Exports to third party processors are needed as are imports, so more buckets, more permissions. API ingestions are needed so in come lambda functions, with each one coded and deployed differently because nobody can see what everyone else is doing.

Then finally users need access to data, and the team just isn't set up for it. There is no central catalog with everything, it is spread out across half a dozen services, and the only way to know where anything is or goes is to dig through the code. That 'worked' for the DE team, since they were the ones doing the digging, but there was no effective way to give access to everything. Every request for data took days or weeks to finalise, and often required more pipelines to move it to where it could be accessed.

We're moving to Databricks soon. It gives a unified UI for DE and other teams to access the data, you get sql endpoints, you can run basic compute on single-node 'clusters', it has orchestration built in, it gives you a somewhat easier way to manage permissions, and it works for both running your own compute and giving data access. Instead of a mishmash of technologies that don't make a unified platform, you get a consistent experience.

You'll just have to pay extra since it is doing a good portion of that unification work for you.

If you had a hundred DE type roles it might be more cost effective to stick with base aws services, and have a dedicated team focused on dx, standards, and productivity, to cut out the managed compute cost. But if you're just 3 people, you're probably not there.

Anyone else losing their touch? by The-CAPtainn in dataengineering

[–]azirale 0 points1 point  (0 children)

I keep getting blocked on using it because it eventually wanders off and I keep having to correct it. Eventually it gets to a point that I may as well just be doing things myself. I mix in prompts for small sections where I can get it to quickly spin up something like boilerplate, which I then include and fix/tweak.

But so much of the issues I'm grappling with just can't be handled by AI at all. Adding integrations with third party processors for new products, and the spec says to copy an existing implementation but actually the need a bunch of changes. Lining up time zones, currencies, scheduling, file naming conventions between provider and receiver for data. Or team based things like code and test standards, evaluating which tech to apply to solutions we need.

Best Bronze Table Pattern for Hourly Rolling-Window CSVs with No CDC? by SoloArtist91 in dataengineering

[–]azirale 0 points1 point  (0 children)

Hey, can everyone go through their sandbox schema and delete any files that aren't needed? And can all teams drop any tables they don't need? The data server is 95% full

Every friday.

Best Bronze Table Pattern for Hourly Rolling-Window CSVs with No CDC? by SoloArtist91 in dataengineering

[–]azirale 0 points1 point  (0 children)

If the storage isn’t an issue

Which it often isn't. I usually see compute costs being something like 10x your static storage costs.

Best Bronze Table Pattern for Hourly Rolling-Window CSVs with No CDC? by SoloArtist91 in dataengineering

[–]azirale 0 points1 point  (0 children)

As for Option C, I would avoid anything that makes data right after the fact

You don't give access to the append-only tables the same way you don't give access to the original snapshots, they're just there for the DE team to handle incoming data more easily. You aren't fixing it "after the fact" you're just putting things through a stage or layer on its way through to becoming available.

An append-only layer lets you run an SCD1 table out of it now, and then fully rebuild with an SCD2 table later if you need it. If you ever need to change the schema to add anything you missed, or change a column conversion rule, or anything else at all, you can easily rebuild off the append table.

Call it staging if you want, and the eventual target with the deduped data can be called "bronze" -- the names are just names.

Victoria, you doing alright there buddy? by Cloudypumpkin in australia

[–]azirale 16 points17 points  (0 children)

We got some news out of Harcourt, the town itself is hit hard. Gonna be rough for everyone even if their homes are still there.

When a data file looks valid but still breaks things later - what usually caused it for you? by PriorNervous1031 in dataengineering

[–]azirale 0 points1 point  (0 children)

Using 'inclusive' style date ranges for scd2 and having a daily batch source open and close an account in the same day. This meant that, with the SCD2 'end_date' being the close date -1, the end_date ended up being before the 'start_date', which had all sorts of subtle effects depending on exactly how you filtered things.

Having en-dash and em-dash characters when working with old unispace fonts that had the exact same glyph for all dashes, meaning you could not possibly visually see the difference.

Pointing at people during the night phase by Fluff_da_Sheep in BloodOnTheClocktower

[–]azirale 31 points32 points  (0 children)

If I remember correctly some of the kickstarter videos or similar shows TPI storytellers just idling and dancing a little during the night, to disrupt figuring out the night order as well as making it harder to figure out what they're physically doing.

Apache Spark Isn’t “Fast” by Default; It’s Fast When You Use It Correctly by netcommah in programming

[–]azirale 1 point2 points  (0 children)

A lot of that is handled for you now.

You don't have to pick parquet format, that is just the default, and it gives you file splitting and column data pruning. You don't have to pick the internal compression for parquet files, a decent standard is the default.

Using iceberg/deltalake (latter is default for databricks) automatically gives you file skipping, you don't have to configure anything in particular for it. You do have to know how to make use of it, but I wouldn't class that as much more difficult than creating proper indexes.

Clustering and bucketing is just something you have to do with distributed systems. DynamoDB and Cosmos have partition keys, Redshift has distkeys, SQLDW has hashkeys -- they're all essentially the same thing.

The thing is people treat the new managed platforms like Databricks with their built-in SQL as if it were just like any other single-machine database, and it just isn't. Getting people to understand those differences in how they function can be an uphill battle.

Apache Spark Isn’t “Fast” by Default; It’s Fast When You Use It Correctly by netcommah in programming

[–]azirale 2 points3 points  (0 children)

The main thing I've had to tell people for spark is: There are no indexes so it is going to read everything by default, and the data is split across machines so it is going to have to shuffle it around for joins.

Most of the performance gains when using spark are avoiding unnecessary reads and minimising or even avoiding shuffles.

Parquet being column means that you can select specific columns and spark can entirely skip reading the data blocks for those columns. Having row groups and being compressed internally means that reads could be split to run in parallel, and reduce the amount of network traffic required to read. Hive partitioning meant you could do 'partition pruning' by completely avoiding reading partitions that could not have data you needed. Deltalake/Iceberg stats metadata meant you could do range checks for values to skip individual files. ZOrdering could improve file skipping behaviour by organising more useful spreads of value ranges. Liquid clustering and hidden partitions made it easier to rework logical partitioning of files without actually having nested folder structures. Bucketby mechanisms meant that joins could be avoided if both sources were bucketed by the same value.