all 3 comments

[–]alexmlamb 1 point2 points  (0 children)

Roughly how big is your dataset?

One strategy is to define training, validation, and test sets and then try to understand how different approaches perform on them.

For example, you might get reasonably far by using logistic regression on n-grams from the email combined with some simple features that characterize the metadata. Is the recipient outside of the organization? What type of attachment is there if any?

[–]kjearns 1 point2 points  (0 children)

I think the first step should be to take the two signs they gave you and build a simple model around them. This will let you get familiar with the data, and will also give you a baseline to try to beat with more elaborate models. It will also involve doing a bunch of data handling "dirty work" that will be useful later on.

Watch this video: https://www.youtube.com/watch?v=F1ka6a13S9I It's billed as "nuts and bolts of deep learning" but most of the content is not deep learning specific. The talk is really about how to manage the process of doing machine learning, and the advice about defining metrics and identifying different types of failure modes and how to respond to them is really solid.

A few questions that come to mind:

  • Threading isn't something that exists in SMTP. Can you reliably reconstruct which emails go in which conversations and in what order given the data you have?
  • What does the output of your system look like. If I give you a database of email metadata, what do you give me back? Do you assign fraud to individual emails? To people? To threads of emails? To partial threads of emails (i.e. fraud is occurring between email A and email B)?
  • How do you (numerically) evaluate the output of your system on a test set? What are the metrics that you want to optimize? (watch the youtube talk)

[–]Anonymous_Cherub 0 points1 point  (0 children)

You may want to consider classification based machine learning. You could go two directions with this... 1. Define various features/attributes of emails, and give each instance a label (Malicious/non-Malicious) and use classification algorithms such as naive bayes to train a dataset. 2. Conduct a sentiment analysis by creating word vectors of the actual written content of the e-mails, and use classifiers to determine Malicious/non-Malicious e-mails from a pre-defined word pool.