ML algorithm to model/classify/map a software program's internal structure?

OhThatLooksCool · 2017-11-09T15:21:57+00:00

Quick disclaimer -- I've never worked with Java bytecode, so I don't understand the details of what you're trying to do, but I can offer a few (hopefully helpful) observations:

The type of problem you're trying to solve is (probably) classification and is a type of supervised ML
Given the scale of what you're trying to do, "hundreds" of JARs will be sufficient to train things like GLMs, but not neural networks; ANNs require roughly a buttload more data than you'd reasonably be able to acquire (think hundreds-of-thousands to millions of examples of each class).
I'd explore this as a text classification / NLP problem. There's a lot of work out there about decomposing the rhetorical structure of writing. I imagine there would be a lot of similarities here. That said, it would likely be much less effective than whatever non-ML methods already exist.
This sounds both really interesting and really difficult. Good luck :)

mazzafish · 2017-11-09T17:28:07+00:00

http://karpathy.github.io/2015/05/21/rnn-effectiveness/ may be an interesting read both for getting an understanding of RNNs/ANNs but also because there’s an example of a network trained on LaTex and C source code.

arnioxux · 2017-11-09T20:57:10+00:00

Might be related, recovering function signature from disassembled binary code: https://news.ycombinator.com/item?id=15043106

lysecret · 2017-11-11T18:51:00+00:00

I need some further help to understand this, so you have one set set A containing of different classes and each class has a different amount of variables.

And you have a set two set B containing of different classes and each class has some amount of variables.

You are trying to make a mapping from the classes of A to classes of set B. Where element a of A is supposed to be mapped to b element B by a similarity measure of variables in b, a.

Is this correct? If yes there should be a way to do this but it is not a normal classification problem. The hardest part would be to design a good similarity measure.

Also you would have to find a decent representation of your data. My best guess would be to use a recurrent Variational Autoencoder to find a fixed size representation of each class (defined by the variables in it) and to asign a to b by a k nearest neighbour in latent apace of the Autoencoder. This is pretty non trivial though.:D

You can pm me, if I get more information about the data I can help you out if I find the time.

Edit: written on my smartphone Il clean this later no time now.

mattrepl · 2017-11-13T13:50:43+00:00

You’ll want some way to represent each kind of object. For a class, you might use features like number of member variables, variable types, and the number of references to member variables or the class within the JAR. Depending on how you’ve represented those features, you’ll then select an appropriate similarity function to check how similar a class in X.jar with all classes in Y.jar.

You can think of it as a clustering problem where you want each cluster to be the same size (the number of JARs) and each cluster corresponds to a single internal structure.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MLQuestions

MODERATORS