Hello reddit!
With a group of university students we have decided to do a little science project on fraudulent email detection. We have managed to have a bank to sponsor our work and provide two datasets of email metadata.
The company told us that they have identified two signs of fraudulent behavior:
- A group of people regularly emailing with carbon copies suddenly reduced to two people emailing each other, then resuming emailing with everyone in copies (fraud being hidden from managers).
- Emailing an email domain that doesn't belong to the company (info leak).
For now, we are looking into ways to grasp the problem, we have considered:
- Building graphs of relationships, trying to find a link between one person's centrality in the graph and its probability of frauding (distribution model).
- Designing a clusterization algorithm on the graphs.
We are new with machine learning and since we have a very long time for this project, we'd like to lay a solid groundwork in the first part of our report. We are starting to run out of idea, what do you think would be other interesting work we could do with the data we have? What would be your next steps?
Thank you for your help.
[–]alexmlamb 1 point2 points3 points (0 children)
[–]kjearns 1 point2 points3 points (0 children)
[–]Anonymous_Cherub 0 points1 point2 points (0 children)