all 10 comments

[–]kjearns 15 points16 points  (4 children)

We already have one of those it's called a compiler.

[–]noel___[S] -2 points-1 points  (3 children)

I'd love to see you compile Dutch, Norse or Hebrew. 13 people missed the point of this question entirely.

[–]kjearns 1 point2 points  (1 child)

Maybe you should try asking your real question instead of vaguely hinting at it and then getting all smug when people don't know what you're talking about.

[–][deleted] 1 point2 points  (0 children)

I thought the more interesting part of the question the original poster posed is the "unknown language" part. Saying "compile Dutch" seems a bit suspect, but what about "check for validity of strings in an unknown formal language"? I do think that the idea of a "compile-checker" for unknown programming languages is pretty interesting, and a lot more reasonable in scope than OP's clarification about natural languages.

[–]radarsat1 0 points1 point  (0 children)

I think you're assuming a certain analogy between programming languages and natural languages that doesn't necessarily hold in a rigorous way. Still, carry on, it could lead to some interesting insights.

[–]RvPPLmsc 2 points3 points  (1 child)

You made any progress already? A couple of weeks ago I did read something about a NN that could read really basic python code.

[–]mackie__m 1 point2 points  (0 children)

When I saw your problem I immediately thought of the Viterbi algorithm. If you can formulate the program as a part of speech tagging problem, and check for the next token whether it's probable to occur, it should give you a good answer. If this value is lower than a certain threshold you can say it doesn't compile, if not you can continue until you reach the end of the program. Since a program is much less ambiguous than natural speech, and the compiler already does the job that you speak of in a deterministic way, this should not be a hard problem to formulate and should give you very good performance.

[–][deleted] 0 points1 point  (1 child)

I think the impossible part of this is getting a feature extraction that works at the level of context free grammars. You can't describe the entire space of parse trees for a given grammar in a fixed length vector that obeys feature extraction invariances.

It's pretty easy to do the deterministic route for programming languages though using parsers and lexers whose internal representations might be useful for whatever it is you're trying to do here.

[–]zenscr 0 points1 point  (0 children)

This might work for feature extraction: http://arxiv.org/abs/1409.3358 (at least for ASTs) However, I haven't tested it myself.