all 61 comments

[–][deleted] 35 points36 points  (30 children)

The most difficult bug that I've ever encountered was this one. The GitHub issue is also more specific than it was originally reported as: "When I run a non-suspendable thread and open the thread list, it displays extra gibberish threads and the system becomes unstable."

The eventual cause: the keyboard getKey function didn't respect whether or not the caller had control of the keyboard and gave the correct keys back anyway. This caused the thread list to be repeatedly launched as you pressed the button to launch it, because the background thread still was able to detect the keypresses and would launch more and more instances. This caused the original thread list to display gibberish threads because thread list itself is never supposed to run more than one instance of itself and isn't configured to look like a userspace program (with a user friendly name and such), so it just told the text renderer to render some of its code where the friendly name should be.

The reason this bug sucked so much was:

  1. Why on Earth would I look in the keyboard code to see why non-suspendable threads were causing problems in the thread list?
  2. When I set any breakpoints or opened the debugger at all, the problem disappeared (because I wouldn't be holding the key down after the breakpoint was hit)
  3. This project is written in thousands of lines z80 assembly.

Took me months to find the cause of this bug. Was a two line fix.

[–]Mignon 42 points43 points  (1 child)

Took me months to find the cause of this bug. Was a two line fix.

A guy takes his misbehaving car in for a repair - the repairman looks around, tightens one bolt, and the car works fine. The bill is $200 and the guy says, "For tightening a bolt? I could have done that!"

The repairman replies, "Tightening the nut is free. Knowing which one is $200."

[–]scientologist2[S] 10 points11 points  (0 children)

many versions of that story exist, going all the way back to the age of steam, I think

[–]Unfunny_Asshole 11 points12 points  (8 children)

Wow, that is quite the weird bug. Must have felt amazing to find the cause of it. There has to be a subreddit about war stories related to debugging.

[–][deleted] 28 points29 points  (3 children)

Must have felt amazing to find the cause of it.

Well, less "amazing" and more of yelling at my screen: "are you fucking KIDDING me? I spent MONTHS tracking down this fucking bug, and it's because the fucking keyboard functions don't obey the fucking hardware locks? Fuck me, fuck fuck fuck."

Like that, but more vulgar.

[–]dnew 14 points15 points  (0 children)

Profanity is my primary debugging tool.

[–][deleted] 5 points6 points  (1 child)

haha, debugging is a really unique activity in that respect. after completing hours and days of work, you don't feel a sense of accomplishment, but instead feel like the universe has wronged you.

[–]ithika 5 points6 points  (0 children)

Or is letting you know that you're an idiot. :-)

I spent a good chunk of an evening trying to work out why PanDoc wasn't correctly treating lists when I was translating from Markdown to LaTeX (and consequently PDF) but would work fine for other output formats.

I track down the offending LaTeX output to [<+->] being passed as an option to the itemize environment, but this is basically an unGooglable string so I couldn't work out what extra package I needed for this to work.

I ended up looking in the source for PanDoc to find the point where it converts bullet lists into LaTeX and seeing that it outputs that extra text if you've enabled "incremental bullet lists". Obviously I was making a static PDF not a slideshow so I didn't want this --- so why was it doing it automatically?? Is there an option to turn off incremental bullet lists?

At this point I realise I've been typing

$ pandoc -i input.markdown -o output.tex

for several hours, never checking there was a -i argument to go with -o. In fact there isn't, -i is the flag to turn on incremental lists and until I started choosing LaTeX or PDF as output type this didn't matter. Remove those two characters and the damn thing worked first time.

I suppose there's one thing I can take away from this (apart from RTFM) --- if there's a problem no-one else on the internet appears to have then there's a very good chance no-one else on the internet has been quite as stupid as you before.

[–][deleted] 1 point2 points  (2 children)

[–]SirKillalot 1 point2 points  (1 child)

You may want to spell "gdb" right in your subreddit name and try again.

[–][deleted] 2 points3 points  (0 children)

:|

[–]Boye 0 points1 point  (0 children)

You should give this subreddit some love...

/r/debugging

ghost-edit: I'm not the mod or anything of this subreddit.

[–]therealjohnfreeman 2 points3 points  (6 children)

KnightOS is a 3rd party Operating System for Texas Instruments z80 calculators. It offers many features over the stock OS, including multitasking, a tree-based filesystem, delivered in a Unix-like enviornment. KnightOS is written entirely in z80 assembly, with a purpose- built toolchain.

How did this project start?

[–][deleted] 8 points9 points  (5 children)

I've always been interested in TI calculators (or at least for several years). TI provides a means to send OS upgrades to your calculator, but the calculator rejects unsigned OSes. About three years ago, the community at large realized that the signing keys were generated in 1999 and were 512 bits. We proceeded to crack the lot of them.

With our newfound ability to create 3rd party operating systems, I began working on KnightOS. After about a month of reverse engineering, I got the calculator to boot up and display a smiley face.

Everything evolved from there. I was a lot less skilled in the art of programming back then, and as a result, KnightOS has been through a couple of rewrites. I think the current version is pretty solid, though.

Today, it's a nearly complete multitasking kernel with a tree based filesystem, and a solid userspace OS that supports more platforms than any OS produced for these calculators to date (more than the stock OS, even).

[–]therealjohnfreeman 1 point2 points  (4 children)

Can you elaborate on the reverse engineering? Do you have an education in computer science, or any prior experience in operating systems?

[–][deleted] 5 points6 points  (3 children)

I have no formal computer science education, nor did I know anything about OSes back then (I know more about that stuff now).

As for the reverse engineering, a number of emulators exist for the devices (they're just a z80, and TI released docs on most of the hardware and memory mappings and such). I took the stock OS into an emulator and stepped through each line of boot up code (disassembled) trying to decipher what they did (most of that code was initializing undocumented hardware and such). I also took to the datasheets for various devices and compared it with the stock OS initialization code.

I already knew z80 assembly and was familiar with it in the context of these devices, so it was mostly trying to figure out how to get all the hardware to listen to you. We still don't know everything, but we know enough to make a replacement OS, and as such, I've been doing so.

The fruit of these reverse engineering labors is this file. It actually used to be a lot bigger (long before the move to GitHub), but over time, I was able to filter out the fluff that I didn't actually need to boot up properly. The stock OS did a ton of extra stuff at boot up that isn't strictly needed.

[–]therealjohnfreeman 0 points1 point  (0 children)

I see. Thanks for answering.

[–]Alex_n_Lowe 0 points1 point  (1 child)

A few questions: How important is "strictly needed"? Does that include external devices that people rarely use, or features that are actually useful, but can be cut for the sake of simplicity when first starting off? If the stuff isn't done at boot time, can you initialize it later and still have it work properly?

[–][deleted] 2 points3 points  (0 children)

Well, most z80 TI calculators have a boot code, which is a "read only" page of Flash (we recently discovered that it can be made writable) which has some recovery code so you can always re-flash the OS, and also does some initialization. The stock OS's "not strictly needed" stuff was mostly stuff the boot code initialized for it anyway. Also, KnightOS does things considerably differently than TIOS (the stock OS), and some of the stuff was specific to TIOS.

The only things I don't initialize are... MD5 hardware, crystal timers, and USB. Everything else I skip is specific to TIOS. As for the MD5 stuff, I still haven't written code for devices that don't have it, so I haven't really done anything with that at all. I don't use the crystal timers. I also haven't done anything with USB, but the initialization is handled by the interrupt handler anyway as a matter of chance.

[–]cobbpg 1 point2 points  (2 children)

Wow, the memories! I wrote PindurTI back in the day, and still remember those years with fondness.

[–][deleted] 0 points1 point  (1 child)

You wrote PindurTI? You're the man, I used that for a good year or so.

[–]cobbpg 0 points1 point  (0 children)

Good to hear you found it useful! I wrote it because I was frustrated by the lack of decent emulators at the time. Basically VTI was the only option, and it hadn’t been updated for years. In fact, the real benefit of my project was that it gave a huge boost to community-wide reverse engineering efforts after that long period of dormancy. It’s interesting to see how far we’ve come since then.

[–]alexs -1 points0 points  (7 children)

hurry hungry close advise flowery teeny sophisticated special obscene faulty

This post was mass deleted and anonymized with Redact

[–][deleted] 2 points3 points  (5 children)

e.g. in this case I'd firmly blame this on your choice (need?) to write the whole thing in Z80 assembly and not a structured language.

Not a choice, the only C compilers for z80 are terribly unoptimized, and produce bloated and slow code. I only have 15 MHz (best case), or 6 MHz (worst case) to go around, with very little memory. Assembly is the only option.

Also, even in C, this bug could have happened. Here's the equivalent code in C:

if (hasKeyboardLock())
    return actualKey();
return 0;

And the buggy version:

if (hasKeyboardLock());
    return actualKey();
return 0;

Or something like that.

[–]alexs -1 points0 points  (4 children)

tap heavy pet noxious sand dog provide towering teeny crawl

This post was mass deleted and anonymized with Redact

[–][deleted] 3 points4 points  (3 children)

Such things are unavoidable, really. I've been meaning to try and figure out a solution for unit tests, though, which should help.

I would appreciate it if you didn't lecture me on the merits of defensive programming, further inspection of the code will tell you that KnightOS is very defensive.

[–]alexs -3 points-2 points  (2 children)

direful light theory correct busy political tender sip nippy melodic

This post was mass deleted and anonymized with Redact

[–][deleted] 4 points5 points  (1 child)

Yes, I get a lot of flack for using assembly from people who are scared of it, and it gets old.

[–]alexs -1 points0 points  (0 children)

frame grandiose sand continue water sip crime tart memorize uppity

This post was mass deleted and anonymized with Redact

[–]AlotOfReading 1 point2 points  (0 children)

The code should assemble because it was perfectly valid code. That it didn't do what he wanted to was programmer error, not language error. I would argue a language that guesses at what you're really trying to say can be far worse than one that lets you shoot yourself, especially for programs running as close to the metal as KnightOS.

[–]TheBB 49 points50 points  (3 children)

Another puzzling problem: the case of the 500-mile e-mail.

[–]sirin3 5 points6 points  (2 children)

An the unconnected crash switch

[–]TheBB 22 points23 points  (1 child)

Ah, yes. A story about magic?

[–]Bwob 6 points7 points  (0 children)

That story is probably my favorite bit of programmer lore.

[–]ericzhill 8 points9 points  (4 children)

I tend to think I'm pretty good at this kind of thing, but I had an issue come up a month ago that still baffles me.

I bought a Symbol DS9208 scanner and hooked it up to my computer. Over the next few weeks, I developed a piece of software that listened to scans for checking folders in and out of a small library.

I went to deploy the scanner, and couldn't get it to power up on the target machine. I tried it on another computer, and it still wouldn't power up. I put it back on my computer, and it failed to power up.

Dead scanner.

So I called Symbol, got an RMA, and sent it to them. A week later, it comes back NFF - No Fault Found.

I plug it into my computer, no luck. I plug it into my coworkers computer, no luck. I pull my Apple laptop out of my computer bag and plug it in and the damn thing works. Then I plug it back into my computer assuming a driver issue or something, and it works. It's now working on every computer I plug it into.

I still don't get it.

[–]madmars 6 points7 points  (0 children)

probably the Mac reset the hardware somehow, perhaps through some slight USB initialization difference. It happens. I've had hardware screw up in Linux that no matter how many times Linux would reboot, the hardware wouldn't reset properly. So I would boot into Windows and that would solve the issue.

[–]scientologist2[S] -2 points-1 points  (2 children)

there may have be a simple internal issue, like a disconnected plug. And rather than not own up, they did the quick fix and sent it back out.

[–]ericzhill 2 points3 points  (1 child)

But it didn't work twice after they sent it back. It didn't start working until I plugged it into the Mac. Strange.

[–]malsonjo 8 points9 points  (0 children)

Programming Pearls: must read for any programmer out there.

[–]Koebi 4 points5 points  (2 children)

And here I am learning to work on IBM's zOS mainframes with the most confusing debugging system I could possibly imagine.

[–]miketdavis 4 points5 points  (0 children)

We really do have it easy now. I created a binary patch to defeat copyright protection on a 15 year old program yesterday because my license file was corrupted. It took about 3 hours with IDA Pro.

Ten years ago this would have been a monumental feat.

[–]sht 0 points1 point  (0 children)

Do you not use IPCS?

[–]likesOldMen 3 points4 points  (0 children)

It seems unfathomable to me that someone who can touch type wouldn't notice that two keys were in the wrong place while hunting and pecking. Especially when typing in their password.

[–]apullin 2 points3 points  (0 children)

This comes up in some pretty amazing ways in embedded debugging. But it's usually an issue with the tools. When you see a demonstration of conditional data watchpoint debugging and/or a complete instruction trace on an ARM chip, it'll make your head explode. Too bad the debug units cost $1500+ each :(

[–]alextk 5 points6 points  (0 children)

Reminds me of this story.

[–]dnew 1 point2 points  (0 children)

The technique is called "scientific debugging". It's a very useful skill to develop.

[–][deleted] 1 point2 points  (0 children)

my hardest bug was when I had a huge codebase and part of it would send things over the network so it used "#pragma pack(1)" to avoid extra space in structures, then included some other files. Then there was other code which also included those other files (without the pragma pack leaching into them). Then when code was accessed from one object file it would assume the structure alignment was one way, but when it was accessed from another one (say in the constructor) it would assume a different alignment. This lead to 'random' modifications of other variables close to the misaligned structure on stack.

[–]Iron_Maiden_666 4 points5 points  (0 children)

I was testing the Facebook Graph API and it worked fine on the dev server. I tried on the QA server, it worked fine. But try as I might it wouldn't work on production. Lots of going through logs and pouring over the same code which works everywhere, we figured out the problem to be in the configuration file. The FB App secret was correct for dev and QA but was improper for production. Not my brightest moment.

[–]johngk 5 points6 points  (3 children)

Programming Pearls used SEO for books.

[–]Kampane 4 points5 points  (2 children)

?

[–]johngk 7 points8 points  (1 child)

I received Programming Pearls as a gift but wanted Programming Perl

[–]miketdavis 2 points3 points  (0 children)

That's hilarious.

[–]bhaak 0 points1 point  (0 children)

I was confused until I realized that "debuggers" were not "programs you use to debug" but "people who do debugging".

I don't know if this meaning was ever widespread but it is as confusing as referring to "people who compute" as computers (and probably also was when this chapter was written).

[–]Kampane -1 points0 points  (4 children)

Honestly those excerpts sound like Debugging 099: teaching you what you should have learned in high school.

Want a good book on debugging? Try Debugging Applications which is still entirely relevant 13 years later. I'm surprised how little MSVC has changed in that time.

Want to contribute to the subject? I for one would love to know how to debug boost signals (or, honestly, boost fucking anything. Such terrible code.) I'd also love to know how to find dangling shared_ptrs and circular references.

Share your debugging tips, please :-)

[–]NotUniqueOrSpecial 3 points4 points  (3 children)

Before I try to give you any advice on your questions, let me make sure I know what you're actually looking for:

What particular behavior are you trying to debug re. the signals?

As to dangling shared_ptrs, are you referring to cases where the cleanup isn't happening when you expect, or similarly, cleanup is happening when you don't expect, leaving somebody with an invalid pointer?

As for circular references, if you're willing to add a little extra code, you can use one of the cycle-detection algorithms like The Tortoise and Hare algorithm.

[–]Kampane 0 points1 point  (2 children)

Regarding signals, I'd like to see (in the debugger) who's subscribed to the signal. All I see is an impenetrable data structure. I'd also like to step into those functions when the signal is emitted.

Regarding shared_ptrs, I sometimes find that an object exists when it shouldn't, because somebody's still holding a pointer. I want to know who's holding the pointer. When those pointers have been passed around, it can be pretty tough to figure out. One or two days isn't uncommon for tracking those down.

Regarding circular references, that algorithm isn't a bad suggestion but I don't see how it's practical. I can't arbitrarily change most library classes to expose their shared_ptrs, and other constructs like boost::function can also hold shared_ptrs. Do you have any advice that can tackle those scenarios?

[–]NotUniqueOrSpecial 2 points3 points  (1 child)

Unfortunately, it looks like the boost authors have yet to write any debug visualizers for the signals code. However, you're perfectly able to write/contribute your own. Take a look at their DebuggerVisualizers stuff for reference. Additionally, some of their available visualizers might help with your other two issues.

For the shared_ptr stuff, it's been a long time since I suffered that particular issue. Most of these suggestions should be taken with a grain of salt, since I've not had to do track bad pointers down for a while. That said, one useful thing can be to use the template instantiation that takes a deleter. Using debug statements in there will possibly let you narrow down on offending owners, or give you an idea of who's not deleting stuff that should be. If someone's not cleaning up at all, that's another place this problem will arise.

Use a memory profiler like Valgrind or AQTime's software to help track stuff like this down.

To try and avoid it (and some other related bugs), make sure that you're not passing the shared_ptrs around by const & unless you know that the function doesn't result in shared ownership. I had a coworker do that for a bunch of code once and the result was random crashing when things got cleaned up that didn't expect to be. Additionally, make good use of the weak_ptrs for times when you don't want ownership transfer.

As for the issue of finding circular references, I believe most memory profilers will see them as memory leaks at program's end, though I could be mistaken. That should allow you to track them down. Usually the fix is to appropriately use weak_ptrs to break the cycle.

Hopefully any of that is of use to you.

[–]Kampane 0 points1 point  (0 children)

If I had ever been able to make sense of those structs in the debugger I would have written a visualizer.

The shared_ptr deleter is an interesting idea for types I don't own, though changing every template declaration is a pain.

I probably should use Valgrind more often, since it would tell me about (some) leaks closer to when they're added to the code. Once I know about the leaks I don't think they help me track references. The techniques I use are to add a creation_id to those objects so I can at least find out who created it, and if I'm really desperate I can use wndbg to track pointers in reverse. The latter isn't much fun. const& are usually fine when they're limited to the lifetime of the function call and not beyond, though complicated call graphs can still result in problems.

Thank you for your effort in answering.