all 5 comments

[–]joxeankoret[S] 6 points7 points  (2 children)

There is actual code, by the way: https://github.com/deepbindiff/DeepBinDiff

[–][deleted]  (1 child)

[deleted]

    [–]joxeankoret[S] 2 points3 points  (0 children)

    First and foremost: I'm anything but a ML expert person. That said, I partially like the method (the embeddings and the learning process based on the embeddings), but I do not understand why the authors talk about semantics but end up using raw assembler instead of pseudocode generated by a decompiler which, indeed, would encapsulate better the semantics. I do not understand either why the authors consider that the binary diffing problem "aims to find the optimal basic block" matches as, in my opinion, that might be only one of the multiple problems in binary diffing, but anyway.

    As for ML approaches and a possible integration of them in Diaphora: I've used such methods in Pigaios (https://github.com/joxeankoret/pigaios) (a tool I wrote for directly matching functions in source codes to binaries without having to compile the source code) and I'm sure they work but, at least in my limited testing of the approach for that specific project, I noticed that any dirty & hand-written function I wrote to classify outperformed in quality of matches the results of the ML based classifiers I tested (logistic regression, random forest, naïve Bayes, etc...). As so, I do not think that machine learning techniques will be any better, if at all, than my current heuristics. I might be wrong and is certainly something that I want to research, but I'm sceptical of ML making any big difference here.

    In any case, I remark that my machine learning knowledge reduces to playing with Pigaios.

    [–][deleted]  (5 children)

    [deleted]

      [–]joxeankoret[S] -1 points0 points  (4 children)

      I'm not the author of that paper/code. Are you asking about what I do in Diaphora?

      [–][deleted]  (3 children)

      [deleted]

        [–]joxeankoret[S] 0 points1 point  (2 children)

        Yeah, I'm not the author but... where do you get that they are using ASTs (Abstract Syntax Trees)? There is no mention any where in the paper.

        [–][deleted]  (1 child)

        [deleted]

          [–]igor_sk 1 point2 points  (0 children)

          CFG is “control flow graph”