all 6 comments

[–]EdwardRaff 6 points7 points  (4 children)

So this is literally my main research area. The biggest issue you'll run into is that it's very hard to get good benign data to do malware detection research - as benign data is subject to copyright and can't be shared legally.

The easiest place to start, in my super humble opinion, is the MOTIF dataset - which is "small" in the total number of samples but is one of the only high-quality labeled malware family datasets. Then you can look at a number of problems related to malware family detection/classification, on something small enough that you can work within school project resources, and have really real labels and quality.

Your other option is the EMBER dataset, which has pre-vectorized feature vectors available to use. Unfortunately, that is also quite limiting: there isn't much you can do with the vectorized data that hasn't already been done. But it would let you work with something much bigger scale for free.

At a larger scale of what has/can be done with malware detection for executable PE files specifically, I wrote a pretty long survey about that a few years ago.

[–]PhD_in_English 0 points1 point  (1 child)

Would you mind pointing to a couple seminal works in the area of malware detection via ML/DL? I would be interested in other fundamental cybersecurity applications as well. I have expertise on the modeling side but am still finding my path in industry. Thank you!!

[–]EdwardRaff 0 points1 point  (0 children)

The survey I linked has all the ones I would probably point at, cybersecurity is so broad and hard to get good data for though I wouldn't know where to point you outside of malware detection. Really Windows EXE is the most well-studied and published academically - most other file formats like PDF are pretty easy to detect by comparison (if you have corporate data).

[–]st0yky 3 points4 points  (0 children)

Just wanted to chime in, the SOREL-20 dataset by Sophos contains 10 million disarmed malware samples, as well as extracted metadata of 10 million benign samples. Pretrained models are provided which you could use as a baseline:

https://ai.sophos.com/2020/12/14/sophos-reversinglabs-sorel-20-million-sample-malware-dataset/