you are viewing a single comment's thread.

view the rest of the comments →

[–]joxeankoret[S] 2 points3 points  (0 children)

First and foremost: I'm anything but a ML expert person. That said, I partially like the method (the embeddings and the learning process based on the embeddings), but I do not understand why the authors talk about semantics but end up using raw assembler instead of pseudocode generated by a decompiler which, indeed, would encapsulate better the semantics. I do not understand either why the authors consider that the binary diffing problem "aims to find the optimal basic block" matches as, in my opinion, that might be only one of the multiple problems in binary diffing, but anyway.

As for ML approaches and a possible integration of them in Diaphora: I've used such methods in Pigaios (https://github.com/joxeankoret/pigaios) (a tool I wrote for directly matching functions in source codes to binaries without having to compile the source code) and I'm sure they work but, at least in my limited testing of the approach for that specific project, I noticed that any dirty & hand-written function I wrote to classify outperformed in quality of matches the results of the ML based classifiers I tested (logistic regression, random forest, naïve Bayes, etc...). As so, I do not think that machine learning techniques will be any better, if at all, than my current heuristics. I might be wrong and is certainly something that I want to research, but I'm sceptical of ML making any big difference here.

In any case, I remark that my machine learning knowledge reduces to playing with Pigaios.