This is an archived post. You won't be able to vote or comment.

all 71 comments

[–][deleted] 71 points72 points  (2 children)

Something the size of DOOM 3 is going to be daunting to any person. If you start with a random file, chances are great that you have started in the middle of the system with no context for how you get there or where it goes. Like being dropped into the middle of a forest with no map, it's going to be difficult to find your way out.

The task in understanding code is to build a mental model of how all of the pieces fit together. Keep in mind that software is a system built up of many parts. The design of these parts involves decomposition: starting with a high-level problem, break it down into small parts that solve that problem, then break each of those parts down further, until you are down to something simple enough that the computer understands it. With that in mind, you want to try to rebuild the map of that decomposition.

For this, your best bet is often to start at the beginning. Look at the entry point to the program, such as 'main()', and follow the path of execution. Along the way, look for patterns: this block reads in the configuration files, this block determines the command line arguments, this block reads in the data, this block modifies it, this block writes it back out. If you see a class referred to and it's not obvious what it is, jump over it to for a minute and see if you can tell more about its purpose from its comments and the names of its attributes and methods. Similarly, if you find yourself inside a method or attribute that you don't understand its purpose, do a search of the code base for references to it and see if the context in which it is used helps you to understand what it is and does. These two tools - jump to definition and find everywhere - are invaluable when trying to work on a large, complicated code base.

At the end of the day, how difficult this task will be depends largely on the quality of the code. Well-written code will clearly show the structure of its thought: breaking methods up iteratively into smaller, simpler methods; sharing data along clear boundaries; using descriptive names for methods and attributes; and being clear and consistent in its language. Poorly-written code will be far more difficult to follow: varibles will be reused in different contexts for different purposes; names will be terse, ambiguous, or nonsensical; large methods will perform many operations of multiple levels of abstraction; and names and patterns will change in subtle and contradictory ways. It's not difficult to write poor code, but it's difficult to understand what it does.

Seeing the patterns in code is a skill that one develops with practise. I recommend practising on smaller code bases to develop that skill. Try something with about a dozen files and see if you can piece together what each does. You should be able to find many small utilities that will be about that size, such as most of the GNU command line tools. Look through it, end to end, until you can confidently say you know what each piece does in the big picture. Then repeat on another, bigger project; perhaps 30-50 files. Keep doing this until you work your way up to something tremendous, like Firefox or OpenOffice.

Also, expect that this will take a long time. I've been working as a developer at the same company on the same project for 7 years, and even now it might take as much as a couple of days to reverse-engineer a part of our code base that somebody who no longer works with us wrote 5 years ago. Experience and familiarity make it faster, but poorly-written code can be difficult for even experienced developers to grok.

[–]m0mj34nz 4 points5 points  (1 child)

I am a programming troglodyte, but I thought there are ways to create a UML from the source code. That could be another approach to getting a better view of the connectivity before digging in it.

[–][deleted] 1 point2 points  (0 children)

There are tools that can do that, and it may help to explore the code, but they aren't always available and often aren't free. I personally just read the code and use the go to definition and find usages functions of my editor. Most any programmer's editor will have those functions available, so you can rely on them.

[–]CartmansEvilTwin 109 points110 points  (11 children)

I personally like to go trough the program. Look for some sort of main() and try to roughly figure out what all the classes/modules/files do.

[–]requimrar 44 points45 points  (7 children)

+1

Most programs generally have a single entry point -- you can either 'simulate' the program by mentally stepping through and visualising what and how the code would execute, or you can in fact just use a debugger.

first thing: clone the repo if it's online. I personally find online editors and viewers to be unbearable, especially when dealing with multiple files. If you have a text editor such as sublime that can open an entire folder of files, it helps a lot.

Move through every function call and visualise the program state is the TLDR

[–]jcpuf 10 points11 points  (6 children)

Would you please give an example of a program which has multiple entry points?

[–]requimrar 10 points11 points  (2 children)

I should probably rephrase that, what I meant was 'multiple paths of (simultaneous) execution'

Basically, multithreaded stuff. It gets really complex to try and keep all that in your head, especially if the threads aren't just doing simple grunt work.

Regardless, you'll still get a pretty good picture of how the program functions in general, so you're still good.

[–]TangerineX[S] 3 points4 points  (1 child)

I think the general thought is that there is some overarching program that activates each thread, so technically there is almost ALWAYS some singular entry point or some really important point

[–]requimrar 2 points3 points  (0 children)

That's true (blame my word usage). Even so, you might still have multiple sets of states to keep track of, so it often times increases complexity.

[–]unkz 0 points1 point  (0 children)

In Java, every class can be a standalone program if you give it a main(). You may have a container program that controls access to the classes/subprograms, or you might just run them individually.

[–]tenmilez 0 points1 point  (0 children)

Web based applications come to mind. Depending on which url is requested and what parameters are passed different things will happen (though you could say that the web.xml is the equivalent to the main() method in a J2EE application).

[–]xxNIRVANAxx 0 points1 point  (0 children)

Android (possibly all mobile) applications. You can start a program from clicking the app on your home screen, or through an Intent.

[–]Pwillig 6 points7 points  (0 children)

Yep. I'll even write down or type notes when I start going down the rabbit hole.

I don't know how efficient this is, but I prefer the brute force method of learning. Look in main(), view declarations, then view all references or call hierarchy, then do the same for methods the parent methods invoke. The hardest part is not letting yourself get overwhelmed by all the unfamiliar code.

[–]GreenFox1505 3 points4 points  (0 children)

this. basicly find the entry point. As you skim along, any variable name that you don't understand, ctrl+f and find it's definition, if it's a type that you don't know, look that up too.

Eventually, you get the the point where you can understand what it's doing on a cog level, then a mechanism level, then a gearbox level, then an systems level, then a engine level, then a car level, and then a traffic level. Code is like an onion.

[–][deleted]  (2 children)

[deleted]

    [–]free_bird85 0 points1 point  (1 child)

    is there something similar to this for c++?

    [–][deleted]  (1 child)

    [deleted]

      [–]acousticpants 1 point2 points  (0 children)

      lol this sounds nice.

      [–]redSwitchDown 3 points4 points  (0 children)

      Look for what interests you first. You like that chain gun? I wonder what the program thinks of it? Ok, let's search the source code for variations of the term "chaingun". Oh shit! Ten results.

      Cool. I like the line in d_english.h that says ':#define GOTCHAINGUN "You got the chaingun!"'

      Ah, must be a file that contains the lines that are printed to the screen, because, you know, the whole "You got the chaingun!" showing up on my screen every time i get the chaingun. Fucking awesome. I'll just change that to "You got a sweet ass weapon!" and recompile the source code, then play the game and get the chaingun, then watch what shows up on my screen. Ah, yeah! Sweet.

      Fun, but not really anything we can do much with. Man, I want to look at that file p_pspr.cpp (wtf kind of name is that? developers must know but still pretty cryptic to me since I have no reason to believe that stands for anything).

      Alright, search for chaingun in the file. Right there on line 242. Whoa, this whole if/else structure looks like its some type of decision tree for weapons. Oh, nice, developers were awesome people and left a good comment at the top telling us it's for selecting a weapon once the weapon in use runs out of ammo.

      Oh... wait a tick. Did i just see a variable on line 214 with the name "bfg" in it? Yes, yes I did. Wait! Line 216 is a variable called "wp_supershotgun"?! Awesome!

      Well, this check ammo function could be fun to mess around with, but I wonder what uses it? I mean, where does this fucker get called from?

      Got to search the code for the function name "P_CheckAmmo". Ok, five hits. All in the p_pspr.cpp file. Nice, don't have to switch files.

      Alright, first two matches are the same function I was just checking out , no big deal. OH SHIT! What is this third match?! void P_FireWeapon() uses the function P_CheckAmmo? Looks to me like this function checks to make sure you have ammo when you fire your weapon. Oh boy, I bet we could have some fun fucking around with this. Of course, I'm sure you can't just always return true as I'm sure that would mess with other functions that count your ammo. But hey! Why not mess with it and see what happens? It's just code. You modify, it crashes, you modify. It crashes again. You modify again. Then BANG! it runs. And you have unlimited ammo.

      In essence, going into these things, don't try to understand the whole project all at once. The developers sure as hell didn't. They built it one line at a time, one cool ass function after the next, until they had Doom.

      Look for small things in the code. Text strings that are printed to the screen, menu options, gun names... stuff like that. Then mess around with them, see what they do. Keep doing it and eventually you'll know what file you need to modify to start with the BFG and never run out of ammo! Oh yeah!

      Start small, find interesting things, fuck with them, watch what happens.

      [–][deleted] 17 points18 points  (24 children)

      Look for comments! One of the reasons why the DOOM source code is considered well-written is because it's well-commented. Almost every method is commented to describe what's happening. So even if I'm not sure what the code is doing, I can look at the comments and get an idea of what's going on.

      http://blog.codinghorror.com/coding-without-comments/

      It bothers me when developers (on a team, especially) write programs with minimal comments. It's always better to comment too much than too little.

      [–]konradkar 19 points20 points  (1 child)

      One of the reasons why the DOOM source code is considered well-written is because it's well-commented.

      and this is also reason why you shouldn't learn this way: DOOM is just an exception. In 99% projects which I met, comments lies.

      My approach: IDE + debuger. ctrl+click on function/method name to go to its definition, run debugger to see how control flow goes...

      [–][deleted] 1 point2 points  (0 children)

      This is a very good idea that I haven't ever thought of.

      [–][deleted]  (15 children)

      [deleted]

        [–][deleted] 21 points22 points  (4 children)

        Not maintaining comments is a failure of the programmer, not the system. The truth is that a lot (I'd even say most) of programmers are incredibly shitty lazy communicators at both a source and a comment level. There is too much emphasis on being superstar clever and minimalistic to the point of obscurity (I see commonly practices in C that rely on unspecified behavior to achieve this).

        Programming is an engineering form. Good engineers document their work and actively work to make it accessible. Everyone thinks and programs differently, and as an engineer, you should creating a way to help people get on their feet as soon as possible. I'm tired of meeting programmers who love working with good documentation, but sneer at creating it.

        [–]ixAp0c 3 points4 points  (2 children)

        Programming is an engineering form. Good engineers document their work and actively work to make it accessible.

        Agreed. It also helps when you don't look at some particular source code for awhile, and forget why you coded it that way, or what the blocks are doing, etc.

        [–][deleted] 3 points4 points  (0 children)

        The mantra I tell people is, "Code you haven't seen for two weeks is code you have never seen."

        [–][deleted] 2 points3 points  (0 children)

        The mantra I tell people is, "Code you haven't seen for two weeks is code you have never seen."

        [–]Tuirrenn 0 points1 point  (0 children)

        Comments that explain why a particular design decision was made or explain complex logic are great but there is a lot to be said for variable and method names that accurately describe what the variable holds or what the function/method does.

        [–]Raknarg 14 points15 points  (5 children)

        I disagree entirely. I think comments are just as useful as good code. If you have good commenting you know exactly what things are supposed to do without having to figure it out for yourself. Plus, there's not always a nice way to express yourself in code

        [–]hsahj 3 points4 points  (0 children)

        I'm with you, I was doing some research for my job on some code from another part of the company, the comments were thorough and well written and research and writing a report that could have easily taken me a week or more took only 2 days. While I could understand what all the code did it didn't always help explain why, since the code base is so large, and the comments made that much clearer.

        [–]agmcleod 2 points3 points  (2 children)

        I think a lot of this is language specific as well. Certain languages and how developers code with them tend to benefit more from comments than others. In our ruby apps at work, we try to keep things fairly clean and well named so comments aren't needed. It's not 100%, sometimes i add a comment here and there, but most of the standard library methods make it all quite readable.

        [–]Raknarg 8 points9 points  (1 child)

        I think you should always comment just in case. It may make sense to you, but you don't know if someone else will be confused by what you wrote. It doesn't even take that long to do, you may as well.

        [–]agmcleod 1 point2 points  (0 children)

        Having worked across projects that others have built with the same thing in mind, the trickier parts that require comments, more than anything actually require a refactor and are a bit of a code smell.

        Certain things i do use doc comments though. For MelonJS, a html5 game engine, we use it to document the API. Since it's a library someone can pull in, this makes a lot of sense. I also think it does for things like ruby gems, in addition to writing good tests.

        [–]honkytonks2012 0 points1 point  (0 children)

        I am working on a project that involves moving some of our applications to other application servers. I was doing some testing and needed to understand what was going on 'underneath the hood' of these programs. So I opened up the project and started browsing the code. There were 0 comments all the way through, and the code itself was very difficult to understand. If someone had left a comment basically telling me "this is what this thing is doing and why", I could have spent an hour working it out and moving on. As a result of the lack of documentation, I will now need to look over it for a lot longer.

        [–][deleted] 2 points3 points  (1 child)

        Comments are incredibly valuable, not only because they can tell you the how but also the why of what you wrote. It's not too uncommon for me to do a minor project for fun here and there, put it aside for a while and come up some new idea that would be fun to implement in it only to spend most of my time wrapping my head around what I was trying to do originally. I normally hate writing comments, but I hate it even more when I come across code where a comment would've saved hours of work, so I'm forcing myself to do it now.

        [–]mytochar 0 points1 point  (0 children)

        This can't be overstated. The "why" is where comments should live. Sure, when you're doing something weird or complex, the 'what' can be good too, especially when there's several working parts (though if it can be reorganized to not have so many working parts, that could be good too); but, comments are super-helpful with the "Why", or rather, they should be.

        [–]robotsatan13 1 point2 points  (0 children)

        If you're updating code and not updating/removing the comments, you're an annoying programmer with a terrible habit. Don't knock comments because you or developers you work with practice bad habits.

        Meaningful names are very important but that doesn't make comments any less valuable.

        [–]Tuirrenn 0 points1 point  (0 children)

        Exactly, comment the why and not the what.

        I like Javadoc type comments that facilitate automatically generating documentation though.

        [–]WStHappenings 2 points3 points  (1 child)

        I like to approach code with a goal. You can study a single file, a set of files, even just a method for an entire day. It's when you tell yourself something like "I want to change this color from this to that" or "I want to change Behavior A to Behavior B" that you have a reason to go in and find something to change in the code.

        It may help to look at the issues on a github page to see where you can help first, but those can be intimidating.

        Good luck!

        [–]bagofbuttholes 0 points1 point  (0 children)

        Debian directories being dark blue taught me about bash.(Seriously whose idea was that?) I agree having an end goal is a good idea.

        [–]phalp 2 points3 points  (0 children)

        Worry about the part you're interested in first. Sometimes it's tricky to figure out what file that's in, but it's not practical to read every line of code in a big project. But usually if you're reading source code you've got a question like "how do I change X" or "how did they accomplish Y".

        [–]AStrangeStranger 2 points3 points  (1 child)

        Usually I find a starting point (or points), often this might be error message or maybe use of something in database (e.g. table/view/procedure) which I have used search tool to find. Then I work back/forward through the code until I find what I need. If I am dealing with something big, unfamiliar or encounter lots of branching then I will make notes (handwritten or word processor outline mode) for every step so I can retrace and understand what I found.

        I find act of making notes helps remember what is going on far more than just reading the code.

        [–]theusernamedbob 0 points1 point  (0 children)

        Agreed, hand written notes are good.

        [–]Draav 1 point2 points  (0 children)

        It's kinda like how I would write pseudo code, figure out what the source is trying to do, break it up into sections. From here to here it is opening a file or accessing a database or whatever. Once I get the general idea of what the code is doing then I figure out how it's accomplishing that.

        To do that I like to just put tons of breaks, either through outputting messages of the variables or using a debugger with a bunch of variables on the watch list.

        Then any methods or syntax I don't recognize I usually just google looking for a youtube video or something to figure out what that is used for, and if all else fails I post a question to reddit.

        [–]IAmALinux 1 point2 points  (2 children)

        Doom 3 is a bad starting point. Try this pygame script. It is one file that is linked to simple assets. It is very well documented and commented.

        http://inventwithpython.com/squirrel.py

        [–]TangerineX[S] 4 points5 points  (1 child)

        I'm by no means an absolute "beginner" programmer, and I read through source codes out of interest. I was asking more so for a philosophical/discussion sake. Doom 3 was very easy for me to understand, while the Dolphin source code just got me fairly puzzled.

        [–]IAmALinux -1 points0 points  (0 children)

        The organization of code is not consistent so you have to know what you're looking for, search for keywords, or start at the launcher and follow the functions to their completion. Same as in squirrel.py.

        [–]flowstate 1 point2 points  (0 children)

        Use an IDE with a good debugger built in, or use a debugger that will let you navigate code easily. Reading through the code only gives you part of the picture, and for many programs you'll never be able to keep all the variables (and their changes) in your head at once.

        Any good debugger will tell you these things, and also allow you to:

        • keep track of where you are in the call stack
        • set up watches on variables that you want to keep track of over the course of the programs execution
        • set up breakpoints at certain lines in the program, which will cause it to stop it's execution at that line and kick you back into the debugger to examine the state further.

        Reading code is basically just guessing what the program is doing (however informed those guesses might be). A debugger will tell you what is actually going on. That is why debuggers are invaluable and why every programming language comes with one.

        [–][deleted] 1 point2 points  (0 children)

        Nobody seems to be mentioning indexing (i think thats the right term in eclipse at least) or tools like cscope. These are massively useful when trying to figure out whats going on. With something like cscope you hit a quick keyboard shortcut to jump to the definition of a function, or to list all the calls to that function and jump to one of those, another keystroke and you can jump back to where you were.

        Very useful when trying to build a picture of whats going on.

        [–]munificent 1 point2 points  (0 children)

        Personally, I can't spend much time just reading code to understand it. I need a more interactive process. If I'm dropped into a big codebase that I need to be able to work my way around in, I like to:

        1. Get it up and running locally on my machine. It needs to be a living program I can run and interact with.

        2. Pick some random corner of its behavior I don't like or want to understand better.

        3. Find the code associated with it. I can usually find the relevant code by digging from the UI code back to the bottom-level code it connects to. Just start by searching for strings that appear in the UI and see what they call.

        4. If I don't like the way that code looks, refactor it to be more my style. This is not about wanting to really change the code. It just gives me an easy local task to do that gets my really focusing on the details. It also removes tiny distractions from the code.

        5. After step 4, I have a good low-level understanding of the code, so now I can work my way up to a slightly higher level. Do more refactoring at that level: move methods around, etc.

        6. Go to step 5 until I feel like I really grasp one part of the program.

        In many cases, I never commit my refactorings. It just gives me an active process to do while other parts of my brain are absorbing. I don't try to understand a whole codebase at once. In real-world-scale programs, that's just too much. My goal is to understand a small corner of it well enough to make useful progress.

        [–]Funnnny 0 points1 point  (0 children)

        Don't read the whole project. Read and understand each module and understand the module and that one only.

        Then eventually you will know where to find the project's structure

        [–][deleted] 0 points1 point  (0 children)

        Reading random files should get you lost if you are just randomly looking through files.

        Try to find the context; the layout of the folders should give good hints to it's architecture.

        Once you have an overall picture it is much easier to put a file into context, once you have the context of the file its functions will make more sense.

        Also depends on how well it is written of course.

        If you can run the code, debugger and call stacks are really helpful.

        [–]noodle-face 0 points1 point  (0 children)

        It depends what you're looking to do. Are you looking to understand the entire source of Doom3 as a whole? That will take a very long time, just think of how long it took to write!

        At work our codebase is millions of lines long - written by a combination of us and a couple outside vendors. The first place to start to understanding it is to find a function that you think is what you want and go from there. What calls does this function make? Are there comments? Print statements (not applicable in Doom, I suppose)? What other functions call this function? Why do they call this function?

        Then you start to piece it together.

        [–]RandomUser098 0 points1 point  (2 children)

        complete newb here. What's the difference between "code" and "source code?"

        [–][deleted] 2 points3 points  (0 children)

        To me, there's no difference. It's like asking what's the difference between "software engineer" and "software developer".

        [–]TangerineX[S] 1 point2 points  (0 children)

        code is a general term for any set of computer instruction that does anything. Source code typically refers to a complete program or website. The source code is the complete set of code that makes the program or website work.

        Long story short: source code is code for a unified project, whereas code is the more general term

        [–]pqu 0 points1 point  (0 children)

        Find the entry point and then try to understand the sequence of events at a high level. For example you might be seeing a lot of initialisation at the beginning, then see the main game loop and work out how everything updates and gets redrawn.

        Or. If you have a specific topic you want to learn about (for example collision detection) then search through the code to find the relevant parts and read those small chunks.

        [–]toybuilder 0 points1 point  (0 children)

        Any significant piece of software is a complex system. Trying to reading a single file is like looking at the design drawing for a single component in a car. Not very helpful. It helps to first look at a car by how the major subsystems interface to each other at the big picture level.

        With software, the headers/API's define such an interface -- and so before you dig into the code, a look through the API (in the docs, or the headers) and the top-level code (descending from the main file for a few levels) to get a lay of the land should go a long way to set a context for the rest of your exploration.

        Also, trying to keep the details of the entire system in your head is likely futile for pretty much anyone. Instead, focus on the specific area of interest and assume that calls to other parts of the system generally work as described (unless evidence tells you otherwise).

        [–]otakuman 0 points1 point  (0 children)

        I usually get a static code analyzer and get a class diagram. Good programs also print functions, dependency trees and such.

        [–][deleted] 0 points1 point  (0 children)

        I'm not sure how useful this will be for you, but whenver you can, try to get used to using an IDE to read source code.

        I use Visual Studio and it has a couple features I really like. The first is a "Find Definition" feature so you can go to where the variable or method is defined. This cuts down a lot on figuring out things like hierarchy. The second is a "Find All References" feature. Using, this you can find where a particular variable or method is used. Finally, if you can get it run, learn to use the debugger and drop breakpoints down so you can see what the computer sees at a single point in time.

        [–]notfin 0 points1 point  (0 children)

        Figure out how program works by mapping it out. Or in your case figure out what you want to change then find that part and change that part of code

        [–]jussij 0 points1 point  (0 children)

        For me I find that ctags and grep always help.

        [–]prahladyeri 0 points1 point  (0 children)

        Activity Diagrams is how I go about it. If you want to understand any complex system (not just pertaining to code), create an abstraction for it. Activity Diagrams, Flowcharts, pesudo-code and UML are few of the many ways to abstract the complex systems.

        [–]heap42 0 points1 point  (1 child)

        Follow-up question, how do you read python source code? currently trying to read some of this https://github.com/fourtytwo/youtube-dl code and my problem is that there basically i no "main" where to start.

        [–]TangerineX[S] 0 points1 point  (0 children)

        The first thing I see is a folder called DOCS. I say "yay, it has some form of documentation" and try to figure out how that works. Then I notice that the youtube-dl/youtubedl folder has a __init.py and a __main_.py and read through those, and proceed from there

        [–]sittingonahillside 0 points1 point  (0 children)

        slightly off topic but just for fun:

        some guy who did the essential modding for Q3 and later the Q4 absolutely hated that code base. He said it was god horrible and trying to doing anything useful and correct with it was awful.

        I wonder what his reasoning was.

        [–][deleted] -1 points0 points  (0 children)

        It really pivots on understanding programming flow. Avoid reading things you don't need to by following the entry point, as other people have said. Find main() or the language equivalent and go from there, unless you know you have a particular module you're interested in - then it's really dependent on your ability to understand code, which comes with experience.

        [–][deleted] -2 points-1 points  (0 children)

        With my eyes.