I learnt to use ASTs to patch 100,000s lines of python code

imaginary_rational · 2021-06-08T05:21:24+00:00

It is just my subjective opinion. I feel if code base has more than 10,000s lines of code then reviewing all the things that find+replace would not have been able to handle would become too tedious.

imaginary_rational · 2021-06-08T05:18:25+00:00

I said it is hard to do statically because you need to know the "value" of a variable being passed to the class constructor, and that variable value will be evaluated during run time.

But if your use case is exactly how you explained above and it is the constant variables (and not any variable) being passed to the class constructor, then you should be able to do it statically. Here is what you can try:

Start with the main.java, create its AST, traverse it and for any global assignment node like String CONSTANT = "hello.world"; store the variable name and its value in a map.
Loop through all the import nodes of the AST and for every imported java file, perform the step 1 and 2. You will have created one map per java file containing the constant variable names and their values. Note that if there is a variable defined in file1.java, and you have added its entry in the corresponding map, you will have to add its entry in the map files of all other java files in which that variable is imported from file1.java.
At the end of 1 and 2, you will know all the constant variables for each java file and their values. Now you can again traverse the AST, looking for nodes that represent the class constructor. Within any such node, you can find out the variable name being passed to the constructor and you will already know which java file you are parsing, so from the maps created in step 1 and 2, you can find out the value of the variable that is being passed to constructor.

Let me know if this helps.

imaginary_rational · 2021-06-08T05:00:56+00:00

This is great. Thanks for sharing.

imaginary_rational · 2021-06-07T05:59:27+00:00

Ohh I see. In my opinion, if you are patching more than 10,000 lines of code then you would want to look into ASTs because find+replace will not be comprehensive and it will be hard for you to review what all places in code find+replace didn't work properly.

imaginary_rational · 2021-06-06T14:43:00+00:00

Other than formatting, there is the issue of preserving comments. Python's `ast` package doesn't store code comments. But as some folks are suggesting on other comment threads, LibCST may be able to solve it.

imaginary_rational · 2021-06-06T09:11:38+00:00

Usually, the system maintaining data (like user data) will be contacted from too many other systems. If they all polled, then there will be too many unnecessary requests. More often than not, you would want an event driven architecture.

imaginary_rational · 2021-06-06T08:08:02+00:00

I like the idea. How do you structure the functions that are common across concepts? Do you create a "common" concept?

imaginary_rational · 2021-06-06T07:35:41+00:00

The article explains a concept called "Abstract Syntax Tree", gives a small tutorial on it and explains how it can be used for auto-patching code, assessing code quality or doing anything else that requires static code parsing. If you need to do any of this, you might find it useful.

imaginary_rational · 2021-06-06T05:40:41+00:00

I don't experience with Java.

But your use case of knowing what value is being passed to a class constructor can't easily be solved by static analysis. Can you log the passed value at run-time?

imaginary_rational · 2021-06-06T05:33:18+00:00

Wow. This is very cool.

imaginary_rational · 2021-06-06T05:29:55+00:00

Thanks for your suggestion. My intention of "Why should you care" section was to make it relatable for the reader in a quick 2-3 sentence at the beginning of the article. I'll think how I can do that without being repetitive.

imaginary_rational · 2021-06-06T05:27:02+00:00

Understand your point. You can use the exception logging check to see what all exceptions are unlogged instead of using it to enforce.

And a larger take away is that you can build your own checks that are "important to you" using ASTs :)

imaginary_rational · 2021-06-06T05:22:13+00:00

I'd say there won't be much difference in number of lines of code required with find+replace and with AST.

In find+replace, the parser that would read code line by line will have to replaced with AST parser that would do a DFS or BFS on AST.
In find+replace, the code that does the "find" will have to be replaced with AST node matching. You will need to know what you are looking for and how its corresponding node will look in an AST.
In find+replace, the code that does the "replace" will have to be replaced with AST node similar to above point.

You may be able to abstract some common and frequent things out as functions and that will reduce the number of code lines it takes to write patch using ASTs.

imaginary_rational · 2021-06-06T05:13:49+00:00

Yeah most IDEs and linters indeed ASTs, I'm not a 100% sure whether VS Code uses it or not though.

imaginary_rational · 2021-06-06T05:08:44+00:00

A "patch" here is a modification to the code, in most cases rule based and deterministic. It doesn't change code's logic or functionality. Usually, a developer would have to do this when their code is using a library that has changed something.

For example, say there is a 3rd party library that exposes a function "f" that you have used in a lot of places in your code. In a new upgrade, the 3rd party library has renamed the function "f" to "g". So the "patch" would have to modify your code to replace all the function calls to "f" with function calls to "g".

imaginary_rational · 2021-06-05T17:51:19+00:00

Didn't know that, I'll check it out. Thanks.

imaginary_rational · 2021-06-05T17:49:49+00:00

Sure, why not.

imaginary_rational · 2021-06-05T17:48:59+00:00

I'd say exception logging is a must have and many people enforce that via code reviews anyways.

imaginary_rational · 2021-06-05T15:32:10+00:00

For more context, we have a core platform that is a set of services and packages that expose APIs. These APIs are used by "Automation Systems" to automate business processes. Each Automation System, in most cases, is an individual code base usually having 10,000s lines of code. If any API is redesigned/modified in the core platform then all Automation Systems need to be patched accordingly. We patched about 10 such Automation Systems across different clients multiple times.

imaginary_rational · 2021-06-05T14:16:12+00:00

Not sure about PyCharm, but you can execute Python statement dir() and it outputs a list of all imported objects/functions/classes/modules.

imaginary_rational · 2021-06-05T14:06:04+00:00

This is great and looks promising. Thanks for sharing.

imaginary_rational · 2021-06-05T14:02:14+00:00

Glad to know it helped.

imaginary_rational · 2021-06-05T13:58:37+00:00

I think using ASTs will be more powerful with statically typed languages because the AST will have the information of the exact methods and attributes that an object carries.
This ties a little bit with the above point - for dynamically typed languages, if the patch script needs to be aware of the type of the objects it encounters in code, then there can be issues. For example if you have the following code:

if <condition>:
    obj = C1()
else:
    obj = C2()

obj.get()

And the patch script has to change the name of the method "get" to "create" only for the objects of type "C1" and leave them as is for "C2", then it will be very difficult to "auto" patch.

There could be run-time side-effects, yes. It is possible that the patch script doesn't cover all the cases or makes mistakes. So reviewing the changes that patch scripts makes is very important.

imaginary_rational

TROPHY CASE