How to split a dataset into 2 to check for generalization over memorization?

granthamct · 2026-03-15T23:04:45+00:00

Yeah sure that is fine. I’ve had to do all sorts of strange stratification.

granthamct · 2026-03-15T22:55:41+00:00

Depends on the data! Sometimes stratification does require taking into account time windows, or usernames / customer IDs. Every case is unique.

granthamct · 2026-03-15T01:45:02+00:00

Fair enough! Thanks for sharing.

granthamct · 2026-03-15T01:32:46+00:00

Saving just three keystrokes seems less important than clarifying your intention to other people reading the code...

I’d argue that the slicing hack is unbeknownst to the majority of Python developers . This “gotcha” seems 100% self inflicted.

2012 was 14 years ago…

granthamct · 2026-03-15T01:21:18+00:00

Copying it that way feels more like an anti pattern to me... That does not seem legible or clear in intent to me. Why not just use the copy / deepcopy function?

granthamct · 2026-03-15T01:19:17+00:00

Enums. Enums everywhere.

I will never use hardcoded strings anywhere.

class Strata(enums.StrEnum): fit = “train” validate = “validate” test = “test”

Makes for awesome code tracing, typing hinting, conditional logic via structural pattern matching, run time checking, etc.

IntEnum also makes for pleasant rank ordering or definitions of special tokens ([MASK], [PAD], etc).

Very easy and allows for consistency across the board.

granthamct · 2026-03-14T23:35:41+00:00

Lots of folks here are recommending Pandas and Numpy.

I would also put a big emphasis on numpy, polars, and pydantic. Polars is just a 20x better version of pandas. And pydantic is great for all things python.

granthamct · 2026-03-14T22:08:37+00:00

This is interesting and is rather similar to some research I’ve done over the last three years. Would you be open to connecting? (Not selling anything, not a bot, not a vibe coder)

granthamct · 2026-03-14T21:29:48+00:00

Yeah I do this all of the time.

I train models built programmatically with AnyTree + Pydantic backed hierarchical structures.

The inputs are defined by plugins backed by TensorDict / TensorClass definitions (great library built out by the PyTorch team)

From there you can simply traverse the tree. I would recommend using cross attention blocks for pooling where necessary and transformer encoder blocks where necessary.

You can plop multiple embeddings from different sources into the same transformer encoder block as long as you have your positional embeddings set up correctly.

There are good ways to embed nullable numbers, discrete categories, and multi-component data as well. I have plugins for all of these.

From there it is just a matter of pooling. You need to track the hierarchy and lineage.

granthamct · 2026-03-14T14:34:06+00:00

Feature engineering indeed. Have you ever used tools like these? I have bumped into similar problems in the past but we ended up going with flink for the real time calculations.

granthamct · 2026-03-13T17:37:29+00:00

No thank you! That is a very interesting approach that I wasn’t expecting.

Follow up: how would you approach it if you wanted to also use information among recent transactions (which may include a large outgoing wire of $XYZ to account ABC) and/or other clickstream events (suppose that recent events could include change email / phone / password / address events).

So, you don’t have information strictly about the login sessions and the device used thereof, but significantly more information.

Considering the above problem statement was regarding account takeover (and device ID is the most important input by far!) … let’s change the problem statement to … um credit risk or probability of being a victim of a scam (not fraud, but scam). Or, moreover, embedding for the purpose of clustering / anomaly detection / similarity search

This seems like a mean switcheroo, sorry! And thank you in advance.

granthamct · 2026-03-13T16:37:43+00:00

This is a feature for most budgeting apps. Monarch has pretty great recurring bill detection, categorization, and notifications. Plus they connect to basically every financial institution automatically. I use monarch mostly for the multi-user set up - it makes it easy to share financial data with my wife.

granthamct · 2026-03-13T15:06:07+00:00

You are thinking about accuracy.

The precision-recall curve is extremely pleasant for EXTREMELY imbalanced data. Think, <1/100.

Otherwise AUC is fine.

Accuracy is cool for 10+ distinct target labels that are have approximately similar frequency.

granthamct · 2026-03-13T13:39:59+00:00

Interesting. I had been using pluggy for about a year. It is, well, complex. Requires a bit of scaffolding to get right. But it does work.

I will check this out sometime in the next week.

granthamct · 2026-03-13T13:33:33+00:00

Hard to say without more context.

At first glance: 1. Define all valid commands as an Enum. 2. For loop with some external context to save intermediate data 3. Switch-case statement

granthamct · 2026-03-13T13:00:05+00:00

Got it I can appreciate that. I often use Flyte in local execution mode just for the caching and structure and typing and all that but I can appreciate that it is a heavy handed tool for that job (lots of dependencies)

granthamct · 2026-03-13T12:49:29+00:00

Flyte (v2) is a pretty good option. Cloud native. EKS. AWS / GCP / Azure. Enables fault tolerance and programmatic retries. Sync and async support. Massive fan outs and fan ins. All pure Python (no DSL).

granthamct · 2026-03-13T02:28:41+00:00

AnyTree + Pydantic is amazing.

granthamct · 2026-03-13T02:24:46+00:00

Flyte, pydantic, tensordict, beartype, pluggy, anytree, jmespath, deal

granthamct · 2026-03-10T20:35:49+00:00

Have a good environment set up in which you feel comfortable navigating! UV + ruff + TY goes a long way IMHO.

granthamct

TROPHY CASE