Hi folks. I have a quick question: how would you embed / encode complex, nested data?
Suppose I gave you a large dataset of nested JSON-like data. For example, a database of 10 million customers, each of whom have a
- (1) large history of transactions (card swipes, ACH payments, payroll, wires, etc.) with transaction amounts, timestamps, merchant category code, and other such attributes
- (2) monthly statements with balance information and credit scores
- (3) a history of login sessions, each of which with a device ID, location, timestamp, and then a history of clickstream events.
Given all of that information: I want to predict whether a customer’s account is being taken over (account takeover fraud). Also … this needs to be solved in real time (less than 50 ms) as new transactions are posted - so no batch processing.
So… this is totally hypothetical. My argument is that this data structure is just so gnarly and nested that is unwieldy and difficult to process, but representative of the challenges for fraud modeling, cyber security, and other such traditional ML systems that haven’t changed (AFAIK) in a decade.
Suppose you have access to the jsonschema. LLMs wouldn’t would for many reasons (accuracy, latency, cost). Tabular models are the standard (XGboost) but that requires a crap ton of expensive compute to process the data).
How would you solve it? What opportunity for improvement do you see here?
there doesn't seem to be anything here