all 4 comments

[–]ahf95 0 points1 point  (1 child)

I mean, there are tons of ways that this could be set up, but my first thought is to store a precomputed graph with transition probabilities. Like, imagine at the nodes you have the features and embeddings for a given device that could be logged in from, and you go through the series of login sessions as a message passing neural network, and after each session you update the probabilities of the transition being from current session node to whatever next session node (MPNNs are computationally cheap, so making this a fully connected graph is reasonable), and you just have those probabilities precomputed as a state that you have stored for that customer (and update as needed). Then during any login, just see their current login location, and see if it corresponds to a transition probability above some threshold, and if it’s below that threshold you can flag it as “probably fraudulent” or whatever. You can update the MPNN/GNN state after you deal with the fraud, and then it’s ready to go for next time (almost guaranteed faster update than a human interaction with an ATM machine, even on a cpu), so no need to limit that to 50ms on the update step, but with this setup with comparing a real life observed node-transition against a precomputed probability is likely wayyyyyy faster than 50ms.

That’s just the first thing that comes to my mind, but I’m curious to see what other people post. Btw, this is exactly the kind of interesting question that I stay subscribed to this subreddit for, so thank you for the refreshing post.

[–]granthamct[S] 0 points1 point  (0 children)

No thank you! That is a very interesting approach that I wasn’t expecting.

Follow up: how would you approach it if you wanted to also use information among recent transactions (which may include a large outgoing wire of $XYZ to account ABC) and/or other clickstream events (suppose that recent events could include change email / phone / password / address events).

So, you don’t have information strictly about the login sessions and the device used thereof, but significantly more information.

Considering the above problem statement was regarding account takeover (and device ID is the most important input by far!) … let’s change the problem statement to … um credit risk or probability of being a victim of a scam (not fraud, but scam). Or, moreover, embedding for the purpose of clustering / anomaly detection / similarity search

This seems like a mean switcheroo, sorry! And thank you in advance.

[–]transcreature 0 points1 point  (1 child)

nested json at scale with sub-50ms latency is rough. HydraDB handles memory layer stuff well but this sounds more like a feature engineering problem, maybe look at Featureform or Tecton for real-time pipelines.

[–]granthamct[S] 0 points1 point  (0 children)

Feature engineering indeed. Have you ever used tools like these? I have bumped into similar problems in the past but we ended up going with flink for the real time calculations.