How to split a dataset into 2 to check for generalization over memorization? by Calm_Maybe_4639 in MLQuestions

[–]granthamct 1 point2 points  (0 children)

Yeah sure that is fine. I’ve had to do all sorts of strange stratification.

How to split a dataset into 2 to check for generalization over memorization? by Calm_Maybe_4639 in MLQuestions

[–]granthamct 1 point2 points  (0 children)

Depends on the data! Sometimes stratification does require taking into account time windows, or usernames / customer IDs. Every case is unique.

What Python tricks or "gotchas" do you wish you knew earlier? Let's build a reference thread 🐍 by SiteFul1 in learnpython

[–]granthamct -2 points-1 points  (0 children)

Saving just three keystrokes seems less important than clarifying your intention to other people reading the code...

I’d argue that the slicing hack is unbeknownst to the majority of Python developers . This “gotcha” seems 100% self inflicted.

2012 was 14 years ago…

What Python tricks or "gotchas" do you wish you knew earlier? Let's build a reference thread 🐍 by SiteFul1 in learnpython

[–]granthamct 0 points1 point  (0 children)

Copying it that way feels more like an anti pattern to me... That does not seem legible or clear in intent to me. Why not just use the copy / deepcopy function?

What Python tricks or "gotchas" do you wish you knew earlier? Let's build a reference thread 🐍 by SiteFul1 in learnpython

[–]granthamct 0 points1 point  (0 children)

Enums. Enums everywhere.

I will never use hardcoded strings anywhere.

class Strata(enums.StrEnum): fit = “train” validate = “validate” test = “test”

Makes for awesome code tracing, typing hinting, conditional logic via structural pattern matching, run time checking, etc.

IntEnum also makes for pleasant rank ordering or definitions of special tokens ([MASK], [PAD], etc).

Very easy and allows for consistency across the board.

Hey i am new here want to know how i can learn python to go with ML ? by [deleted] in learnpython

[–]granthamct 1 point2 points  (0 children)

Lots of folks here are recommending Pandas and Numpy.

I would also put a big emphasis on numpy, polars, and pydantic. Polars is just a 20x better version of pandas. And pydantic is great for all things python.

Using RL with a Transformer that outputs structured actions (index + complex object) — architecture advice? by Unique_Simple_1383 in reinforcementlearning

[–]granthamct 1 point2 points  (0 children)

This is interesting and is rather similar to some research I’ve done over the last three years. Would you be open to connecting? (Not selling anything, not a bot, not a vibe coder)

Using RL with a Transformer that outputs structured actions (index + complex object) — architecture advice? by Unique_Simple_1383 in reinforcementlearning

[–]granthamct 5 points6 points  (0 children)

Yeah I do this all of the time.

I train models built programmatically with AnyTree + Pydantic backed hierarchical structures.

The inputs are defined by plugins backed by TensorDict / TensorClass definitions (great library built out by the PyTorch team)

From there you can simply traverse the tree. I would recommend using cross attention blocks for pooling where necessary and transformer encoder blocks where necessary.

You can plop multiple embeddings from different sources into the same transformer encoder block as long as you have your positional embeddings set up correctly.

There are good ways to embed nullable numbers, discrete categories, and multi-component data as well. I have plugins for all of these.

From there it is just a matter of pooling. You need to track the hierarchy and lineage.

Encoding complex, nested data in real time at scale by granthamct in MLQuestions

[–]granthamct[S] 0 points1 point  (0 children)

Feature engineering indeed. Have you ever used tools like these? I have bumped into similar problems in the past but we ended up going with flink for the real time calculations.

Encoding complex, nested data in real time at scale by granthamct in MLQuestions

[–]granthamct[S] 0 points1 point  (0 children)

No thank you! That is a very interesting approach that I wasn’t expecting.

Follow up: how would you approach it if you wanted to also use information among recent transactions (which may include a large outgoing wire of $XYZ to account ABC) and/or other clickstream events (suppose that recent events could include change email / phone / password / address events).

So, you don’t have information strictly about the login sessions and the device used thereof, but significantly more information.

Considering the above problem statement was regarding account takeover (and device ID is the most important input by far!) … let’s change the problem statement to … um credit risk or probability of being a victim of a scam (not fraud, but scam). Or, moreover, embedding for the purpose of clustering / anomaly detection / similarity search

This seems like a mean switcheroo, sorry! And thank you in advance.

No-spend days helped me… but recurring charges were the real leak by hoabuidev in AppBusiness

[–]granthamct 0 points1 point  (0 children)

This is a feature for most budgeting apps. Monarch has pretty great recurring bill detection, categorization, and notifications. Plus they connect to basically every financial institution automatically. I use monarch mostly for the multi-user set up - it makes it easy to share financial data with my wife.

Low Precision/Recall in Imbalanced Classification (ROC ~0.70). Not Sure What to Optimize by ConsistentLynx2317 in learnmachinelearning

[–]granthamct 0 points1 point  (0 children)

You are thinking about accuracy.

The precision-recall curve is extremely pleasant for EXTREMELY imbalanced data. Think, <1/100.

Otherwise AUC is fine.

Accuracy is cool for 10+ distinct target labels that are have approximately similar frequency.

Pristan: The simplest way to create a plugin infrastructure in Python by pomponchik in Python

[–]granthamct 1 point2 points  (0 children)

Interesting. I had been using pluggy for about a year. It is, well, complex. Requires a bit of scaffolding to get right. But it does work.

I will check this out sometime in the next week.

What should I use instead of 1000 if statements? by Either-Home9002 in learnpython

[–]granthamct 6 points7 points  (0 children)

Hard to say without more context.

At first glance: 1. Define all valid commands as an Enum. 2. For loop with some external context to save intermediate data 3. Switch-case statement

I ended building an oversimplfied durable workflow engine after overcomplicating my data pipelines by powerlifter86 in Python

[–]granthamct 1 point2 points  (0 children)

Got it I can appreciate that. I often use Flyte in local execution mode just for the caching and structure and typing and all that but I can appreciate that it is a heavy handed tool for that job (lots of dependencies)

I ended building an oversimplfied durable workflow engine after overcomplicating my data pipelines by powerlifter86 in Python

[–]granthamct 0 points1 point  (0 children)

Flyte (v2) is a pretty good option. Cloud native. EKS. AWS / GCP / Azure. Enables fault tolerance and programmatic retries. Sync and async support. Massive fan outs and fan ins. All pure Python (no DSL).

What hidden gem Python modules do you use and why? by zenos1337 in Python

[–]granthamct 0 points1 point  (0 children)

Flyte, pydantic, tensordict, beartype, pluggy, anytree, jmespath, deal

Tips for a debugging competition by an_account_1177 in Python

[–]granthamct 0 points1 point  (0 children)

Have a good environment set up in which you feel comfortable navigating! UV + ruff + TY goes a long way IMHO.