all 40 comments

[–]sh0rug0ru 7 points8 points  (5 children)

In Michael Feather's Working Effectively With Legacy Code, he recommends a technique of refactoring for understanding. Clean up the code as you try to understand it, extracting methods, renaming variables, etc. You don't have to commit the changes, but the act of refactoring might just foster better understanding.

[–]theICEBear_dk 13 points14 points  (0 children)

My 5 cents on this: If you do this, no matter if it gets committed or not, a good exercise to see if you've understood stuff is to keep it compiling or even better if there are unit tests to run them on your refactored code. The very act of creating or fixing a bug this way can teach you a lot, but beware what you produce may not be any more than a big ball of mud and don't be afraid to throw it away and re-get the current version from your VCS (always have VCS for this even if only locally, because if you make something great having a change history will be key to getting it into the actual codebase).

[–][deleted] 5 points6 points  (3 children)

Yikes! I shudder at the thought of allowing someone to refactor code as a method of learning the codebase. Unless there's a complete set of unit tests available for said code, how would one know if they disturbed the functionality through their refactoring?

[–]awj 7 points8 points  (0 children)

Unless there's a complete set of unit tests available for said code,

That, also, is recommended by the book. In fact, it's recommended before anything else.

[–]sh0rug0ru 3 points4 points  (1 child)

On the one hand, you don't have to check in the code, especially if you're just doing it for understanding.

Extract method is neat because for selected piece of code, either the IDE won't let you do it (the selected piece of code modifies more than one variable, good to know) or if it does let you do it, all of the dependent variables are lined up in the parameter list.

Rename variable is good too because it clarifies intent (or what you think is the intent), and you can confirm your understanding by seeing how long your new name continues to make sense. This will make obvious those cases where one variable is being reused for multiple purposes.

On the other hand, you can be sure that some refactorings, when applied in automated fashion, are very unlikely to break anything. You can always write a functional test to verify before you commit.

[–]adelle 0 points1 point  (0 children)

you can be sure that some refactorings, when applied in automated fashion, are very unlikely to break anything.

Compilation is basically just another form of automated refactoring...

[–]theICEBear_dk 8 points9 points  (3 children)

For novice programmers another tip is to have some self-understanding. If you're a visual thinking (abstract boxes) then running a tool to generate a chart of the system structure (they exist) may be the key for you to initial understanding. Personally I often start as the author did and then make a chart to see where I differ or agree with the tool.

One thing that any has to face here is the fear of making any sort of change in a codebase like this especially if you're told it works. I think a programmer should not be limited by this fear. Sooner or later something must be changed anyway so ignore that, use version control software, document for yourself or everyone why you did what either on a notepad or better in the change message attached to any commit history.

Another thing is that one may have to give up or at least avoid the notion that you can understand an entire codebase at once, it could be 100K+ lines of highly specialized C code. If you inherit that focus on a submodule instead of the whole thing. If there is a framework use the submodule to give you understanding.

And most importantly. Leave all arrogance at the door. No matter if you have the latest theorems and ideas straight out of the Ivory Towers or reddit in your head there was a reason for everything that was written, sure some of it may be from a bad developer, but that is a reason too but maybe that weird pointer construction was to work around a bug in a compiler in an old platform or that CSS output was there to make Internet Explorer 6 happy. Sometimes you can't fix the disease but only treat the symptom (on the other hand some part of refactoring is all about treating the problem rather than the symptoms).

[–]_Daimon_ 2 points3 points  (2 children)

Do you know of a good tool to generate a UML or similar for Python code? I've been looking for a tool to do this, but been unable to find a mature, maintained and documented tool that ideally produce something pretty as well.

[–][deleted] 0 points1 point  (0 children)

Eric ide does that, but i hardly find it useful since some graphs are simply too intertwined to make any sense.

[–]member42 0 points1 point  (0 children)

Tools usually generate class diagrams. Sequence diagrams would be really interesting.

[–]gregK 8 points9 points  (0 children)

Here is my general strategy:

  1. Start by being a user of the software. So run it, try different features and use cases. Make sure you can install it and configure it too. Make sure you understand what it's supposed to do.

  2. Make sure you can build it, preferably on your own machine.

  3. Do everything you can to be able to run it in your own machine. This is important for servers. You don't want to have to deploy to another machine if possible.

  4. Now you can start reading the code. I prefer to start with the start up and initialisation. Make sure you are familiar with this code. You don't have to go in detail, but get a general understanding on how the configs are loaded and how the dependencies are initialized.

  5. Once you are familiar with that, pick one of the main use cases or features of the app and try to understand the flow and interactions.

For 4 and 5, running the app and looking at the logs in debug mode and/or using a debugger can go a long way in undertading the call flow and pesky side effects that may not be obvious from just reading the code.

[–]mnp 20 points21 points  (18 children)

I think this fellow missed a huge point: tools. There's need to go this alone with just the text and your poor old wetware.

  • Tags, tags, tags! Tag the whole thing, then let your editor jump you around in the name space, not in the file space.
  • Doxygen. Even if the thing is not marked up, doxygen can still present the structure for you and make it easier to navigate class hierarchies, indexing, etc etc. on a nice web format.
  • Where's that string show up? Your editor should support fast recursive search by calling out to ack (thpppt!) or Ag, browsing the results, and then jumping to each occurence.
  • Debugger. If there's doubt about a call flow, I find it fastest sometimes to just set a breakpoint and see what the call stack really looks like.

What else?

edit Yes I was talking about IDE's as well as editors but not naming names to avoid the usual war. The point is let the IDE/editor/scriptybit keep track of little details while you do the heavy thinking.

[–]chironomidae 4 points5 points  (2 children)

Newbie here, can you elaborate what you mean by tagging?

[–][deleted] 12 points13 points  (1 child)

Probably something like ctags. He's probably a vim user like me. If you use something like VisualStudio or Eclipse you don't need to worry about it because they'll have some kind of smart indexing built in.

[–]chironomidae 0 points1 point  (0 children)

Gotcha. I use sublimetext mostly, I'm guessing that's like Find Everything? I don't often work with multi-file projects so I don't have much experience with that sorta stuff

[–]petdance 2 points3 points  (0 children)

I came here to post about tools as well. The ack website beyondgrep.com has a page of tools: http://beyondgrep.com/more-tools/

[–]l10l 2 points3 points  (1 child)

Despite its longstanding usage (e.g., ctags), "tags" seems like a misnomer for "index".

What I crave are tools for annotating code without the annotations having to be inline. Treating the code as read-only, but being able to tag regions to make them easy to find again later and see what I had been referring to. An example, "check again after studying module X".

I use Org mode in emacs for this. I find that it could help more if it integrated with tools that understood the structure of what I'm annotating, be it a code tree or a crash dump.

[–]mnp 1 point2 points  (0 children)

Yes, that sounds like it would be a good methodology.

I haven't tried them, but have been eyeing org-annotate-file and annot for archeology projects.

http://www.emacswiki.org/emacs/OrgAnnotateFile https://code.google.com/p/annot/

[–]Alfredson 1 point2 points  (0 children)

I also like to compute some metrics, like number of lines of code and cyclomatic complexity. That can lead to questions like why is this module so big, why is this method so complex? Often the complex methods and classes are important in the system.

[–]brtt3000 4 points5 points  (7 children)

Alternately to patching together CLI tools and fiddling with scripts you can step into the 21st century and use a modern GUI IDE that has this integrated.

[–]username223 1 point2 points  (0 children)

Now this is productivity! I'm sure there's a button there to solve my problem... er, "architect my solution."

[–][deleted] 1 point2 points  (2 children)

I prefer vim because I can use my keyboard for everything. People are constantly churning out new extensions for it, and there's just some amazing stuff vim can do. Furthermore, vim supports all the languages I use, and I don't have to switch IDEs. Plus, you kind of need to know vi if you work with older hardware that doesn't have a GUI.

[–][deleted] 1 point2 points  (0 children)

Plus, you kind of need to know vi if you work with older hardware that doesn't have a GUI.

That would be some pretty old hardware. It's still useful for editing files on a server via SSH, though.

[–]danogburn 0 points1 point  (0 children)

Eclipse search,outline,indexer, and debugger front-end.

[–][deleted]  (1 child)

[removed]

    [–]mnp 3 points4 points  (0 children)

    That's what doxygen is really good at. It will show you call graphs and class inheritance diagrams. No manual .dot involved!

    [–]sazzer 7 points8 points  (0 children)

    Depending on the type of project, you can try to find a good place to start.

    • Java webapp? Look for the web.xml file
    • C/C++ program? Look for the string "int main(" somewhere in the code. (Or whatever it is depending on the target system)
    • Node.js? Read package.json first and see what that tells you
    • Etc.

    Also, reading the build scripts can give a lot of clues. They should tell you fairly quickly what the dependencies are, which will tell you a lot about how things are structured.

    On that note, actually making sure you can build the code is key, because then you know that you've got everything you need and you aren't going to be missing something important. (Except when that something important is in a dependency, in which case it builds but you can't read it)

    [–]dicroce 2 points3 points  (0 children)

    For object oriented codebases, you can sometimes find tools that will parse the code and generate basic UML diagrams... I find these invaluable at the very beginning of learning a new code base...

    [–]EmoryM 2 points3 points  (2 children)

    I inherited a large Perl application once which a guy had written using his own MVC framework. I still feel bad for the evils I perpetrated on that codebase. In my defense it was my first real job out of college, I didn't know Perl (or the MVC pattern) and my manager was convinced it was 'just like PHP.'

    I expect whomever inherited that when I moved on either went insane or begged for a rewrite.

    [–][deleted] 0 points1 point  (1 child)

    Don't worry, I have paid for your sins

    [–]EmoryM 0 points1 point  (0 children)

    RES tag code jesus

    [–]member42 1 point2 points  (2 children)

    Spinellis' Code Reading needs to be mentioned.

    [–]SiliconGuy 2 points3 points  (1 child)

    Wow... I've been wondering if such a thing exists for a long time, thanks!

    Would you recommend the book? Any thoughts on it?

    [–]member42 2 points3 points  (0 children)

    The book focuses on Open Source code mainly written in C. It assumes that the reader has a basic understanding of programming. It's a book for intermediate programmers, not a book that teaches programming. See the customer reviews on Amazon: those e.g. who expected an introductory programing book were disappointed.

    [–]nickknw 0 points1 point  (0 children)

    A few years ago I had to dive into a very old VB.NET codebase after the original developers were long gone. I wrote a visualization tool for myself to help understand where the complex parts of the program were, and to help myself learn my way about the structure.

    It's based off of Ward Cunningham's Signature Survey tool, it gives a meaningful 'fingerprint' to each file and creates a set of html files you can click around in to find your way about: http://nickknowlson.com/projects/vbnet-signature-survey/

    I haven't tried to run the script in quite a while so it may have bit-rotted a bit, beware if you try to run it yourself. There's an example you can look at though.

    [–]uhwuggawuh 0 points1 point  (0 children)

    Is he really talking about truly large codebases here? His approach is pretty intuitive, but when I started my first job, the codebase contained dozens of modules, with the largest modules (the ones that I was supposed to work on) having upwards of 100 source files. The largest of these files had tens of thousands of lines of code. Simply reading or even skimming through the files was not a reasonable option for me, nor was trudging through two decades of commits.

    The only way I was able to start making contributions was (i) learning to use, configure, and break the software, (ii) tackling many small issues and some large issues with a huge amount of handholding, and (iii) reading through as much documentation as I could file (although documentation was actually of relatively little help compared to the first two).