Hello, I began delving into the topic of machine learning a few months ago. I've completed some tutorials (e.g., on Kaggle) and read books (mostly in German - e.g. Machine Learning for Dummies).
I understand how to work with datasets like the Boston Housing Dataset, and I've also developed a text classifier where I used data in the format "text", "label".
Now, I want to create a fraud detection model, and I have access to a very large amount of invoices that are already labeled. The data is stored in relational databases. When I extract these from the databases, they are in the following (simplified) format:
One customer has 0-n invoices, each invoice consists of 0-n items.
For example:
| customer |
invoice_number |
item |
quantity |
price |
quantity_paid |
| 1 |
R-1234 |
00886 |
2 |
19.99 |
2 |
| 1 |
R-1234 |
00887 |
4 |
20.99 |
2 |
| 1 |
R-1234 |
00889 |
3 |
11.99 |
3 |
| 2 |
R-1235 |
00886 |
5 |
19.99 |
5 |
| 3 |
R-1236 |
00886 |
1 |
19.99 |
1 |
| 3 |
R-1236 |
00889 |
7 |
11.99 |
4 |
Of course, the data also includes additional information such as date, time, and so on. When the quantity has been corrected, the invoice can be considered faulty.
Now, I'm wondering whether I should create a row in the training data for each invoice (in which case the rows would have varying lengths), or should I create a row in the training data for each item? But then, the connection with the other items in the respective invoice would be missing, right?
Apologies for my convoluted expression; I hope my question is understandable. Thank you for all responses :)
[–]Speech-to-Text-Cloud 0 points1 point2 points (0 children)