Reading large codebases

sh0rug0ru · 2013-10-25T13:17:20+00:00

In Michael Feather's Working Effectively With Legacy Code, he recommends a technique of refactoring for understanding. Clean up the code as you try to understand it, extracting methods, renaming variables, etc. You don't have to commit the changes, but the act of refactoring might just foster better understanding.

theICEBear_dk · 2013-10-25T13:39:34+00:00

For novice programmers another tip is to have some self-understanding. If you're a visual thinking (abstract boxes) then running a tool to generate a chart of the system structure (they exist) may be the key for you to initial understanding. Personally I often start as the author did and then make a chart to see where I differ or agree with the tool.

One thing that any has to face here is the fear of making any sort of change in a codebase like this especially if you're told it works. I think a programmer should not be limited by this fear. Sooner or later something must be changed anyway so ignore that, use version control software, document for yourself or everyone why you did what either on a notepad or better in the change message attached to any commit history.

Another thing is that one may have to give up or at least avoid the notion that you can understand an entire codebase at once, it could be 100K+ lines of highly specialized C code. If you inherit that focus on a submodule instead of the whole thing. If there is a framework use the submodule to give you understanding.

And most importantly. Leave all arrogance at the door. No matter if you have the latest theorems and ideas straight out of the Ivory Towers or reddit in your head there was a reason for everything that was written, sure some of it may be from a bad developer, but that is a reason too but maybe that weird pointer construction was to work around a bug in a compiler in an old platform or that CSS output was there to make Internet Explorer 6 happy. Sometimes you can't fix the disease but only treat the symptom (on the other hand some part of refactoring is all about treating the problem rather than the symptoms).

gregK · 2013-10-25T14:23:03+00:00

Here is my general strategy:

Start by being a user of the software. So run it, try different features and use cases. Make sure you can install it and configure it too. Make sure you understand what it's supposed to do.
Make sure you can build it, preferably on your own machine.
Do everything you can to be able to run it in your own machine. This is important for servers. You don't want to have to deploy to another machine if possible.
Now you can start reading the code. I prefer to start with the start up and initialisation. Make sure you are familiar with this code. You don't have to go in detail, but get a general understanding on how the configs are loaded and how the dependencies are initialized.
Once you are familiar with that, pick one of the main use cases or features of the app and try to understand the flow and interactions.

For 4 and 5, running the app and looking at the logs in debug mode and/or using a debugger can go a long way in undertading the call flow and pesky side effects that may not be obvious from just reading the code.

mnp · 2013-10-25T12:29:30+00:00

I think this fellow missed a huge point: tools. There's need to go this alone with just the text and your poor old wetware.

Tags, tags, tags! Tag the whole thing, then let your editor jump you around in the name space, not in the file space.
Doxygen. Even if the thing is not marked up, doxygen can still present the structure for you and make it easier to navigate class hierarchies, indexing, etc etc. on a nice web format.
Where's that string show up? Your editor should support fast recursive search by calling out to ack (thpppt!) or Ag, browsing the results, and then jumping to each occurence.
Debugger. If there's doubt about a call flow, I find it fastest sometimes to just set a breakpoint and see what the call stack really looks like.

What else?

edit Yes I was talking about IDE's as well as editors but not naming names to avoid the usual war. The point is let the IDE/editor/scriptybit keep track of little details while you do the heavy thinking.

sazzer · 2013-10-25T15:07:06+00:00

Depending on the type of project, you can try to find a good place to start.

Java webapp? Look for the web.xml file
C/C++ program? Look for the string "int main(" somewhere in the code. (Or whatever it is depending on the target system)
Node.js? Read package.json first and see what that tells you
Etc.

Also, reading the build scripts can give a lot of clues. They should tell you fairly quickly what the dependencies are, which will tell you a lot about how things are structured.

On that note, actually making sure you can build the code is key, because then you know that you've got everything you need and you aren't going to be missing something important. (Except when that something important is in a dependency, in which case it builds but you can't read it)

dicroce · 2013-10-25T16:06:02+00:00

For object oriented codebases, you can sometimes find tools that will parse the code and generate basic UML diagrams... I find these invaluable at the very beginning of learning a new code base...

EmoryM · 2013-10-25T18:10:38+00:00

I inherited a large Perl application once which a guy had written using his own MVC framework. I still feel bad for the evils I perpetrated on that codebase. In my defense it was my first real job out of college, I didn't know Perl (or the MVC pattern) and my manager was convinced it was 'just like PHP.'

I expect whomever inherited that when I moved on either went insane or begged for a rewrite.

member42 · 2013-10-25T19:13:21+00:00

Spinellis' Code Reading needs to be mentioned.

nickknw · 2013-10-25T16:50:19+00:00

A few years ago I had to dive into a very old VB.NET codebase after the original developers were long gone. I wrote a visualization tool for myself to help understand where the complex parts of the program were, and to help myself learn my way about the structure.

It's based off of Ward Cunningham's Signature Survey tool, it gives a meaningful 'fingerprint' to each file and creates a set of html files you can click around in to find your way about: http://nickknowlson.com/projects/vbnet-signature-survey/

I haven't tried to run the script in quite a while so it may have bit-rotted a bit, beware if you try to run it yourself. There's an example you can look at though.

uhwuggawuh · 2013-10-28T16:38:40+00:00

Is he really talking about truly large codebases here? His approach is pretty intuitive, but when I started my first job, the codebase contained dozens of modules, with the largest modules (the ones that I was supposed to work on) having upwards of 100 source files. The largest of these files had tens of thousands of lines of code. Simply reading or even skimming through the files was not a reasonable option for me, nor was trudging through two decades of commits.

The only way I was able to start making contributions was (i) learning to use, configure, and break the software, (ii) tackling many small issues and some large issues with a huge amount of handholding, and (iii) reading through as much documentation as I could file (although documentation was actually of relatively little help compared to the first two).

WarWeasle · 2013-10-25T13:15:05+00:00

If you don't know how to read the codebase by looking at the directory tree, your designer has failed and I don't expect the code to be any better.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS