Get Crash Reports from Your Users - Automatically

by Joel Spolsky

I used to work as a developer on the client software for one of the largest American ISPs. Our software was used by literally millions of people. Even the rarest bug had the potential to affect hundreds or thousands of users. Yet I felt extremely confident when we decided to push the button releasing our latest code. I remember telling my dad, "The beta is looking great; yesterday we only had twelve crashes in all of North America."

Twelve, huh? Not, say, thirteen?

Nope. Twelve. We were using an increasingly popular technique of collecting crash reports from the field automatically, and summarizing them in our bug tracking database, allowing developers to find and fix bugs that only happened in the field. These were usually the kinds of bugs we would never have caught in the test lab, since we couldn't possibly reproduce every bizarre PC configuration our customers might have.

This crash detection and reporting is starting to become more common. Now that Internet Explorer and Windows XP have it built-in, customers are starting to expect their desktop software to take care of reporting its own crashes.

The confidence you get from finding out about every crash, anywhere in the world, is crucial to delivering a high quality product that needs to be used in the wild. For those of us in the consumer software business it's absolutely critical. You can't rely on your customers to tell you about crashes—many of them may not be technical enough, and most of them won't bother to take time off of their own important work to give you a useful crash report unless you make it completely automatic.

Now that I have my own company, I'm a big believer in this technique. Virtually all the code we write at Fog Creek Software has some way of reporting errors back to the development team via our FogBugz bug tracking database. That includes code that we ship to customers, such as CityDesk, a Windows application, and FogBugz itself, which is a web-based application that our customers install on their own web servers. Both of them report crashes via the Internet. Even software that we write for internal use only, such as the e-commerce software that runs the Fog Creek online store, notifies the development team if a crash occurs.

Anatomy of a Crash

OK, so your code crashed. In almost every programming environment, there's some way to recover from the crash at a central location.  At this point, instead of letting the program die, we display a dialog box (see Figure 1).

 

Figure 1: Our automated crash reporting dialog box is short and to the point.

One thing I've learned over the years is that the more questions you ask people, the less likely they are to answer. So we only ask the bare minimum number of questions that we think will help us diagnose the problem. What were you doing? What is your email address? We emphasize that providing an email address is optional, to alleviate privacy concerns. It's amazing what superstitions persist: in the consumer marketplace, there are lots of people who have been taught by the 11 o'clock news never to give out their email address, to avoid spam, and if you require an email address some percentage of these people just won't send in the report.

Almost any other information that you consider important can probably be obtained automatically, for example, what version of operating system they are using, how much RAM they have, etc.

It's important to emphasize the anonymity and privacy of this crash submittal. People working on confidential data may not be willing to submit a crash report if they suspect we're about to upload all their sensitive work, so we provide a link that users can click on to see exactly what we're going to transmit. To avoid even the appearance of impropriety, be careful to tell people about any automatically gathered information that you're transmitting, too.

Collecting Data

The next question is, what data should we collect that will help our developers find the crash? There is a temptation to grab everything. Every bit of system information you can find. The versions of every DLL and COM control on the user's system. Even a complete memory dump (a.k.a. core dump).

After several years working as a developer and never quite knowing what I'm supposed to do with core dumps, I have discovered that collecting this data is not, actually, necessary. Instead, here is the data we collect:

That's it! Over the years we've found that knowing the exact line of code where the code crashed is enough information to fix almost any crash. For those rare cases where this isn't enough information, you can contact one of the users who experienced the crash via email and ask for any additional information that might help. The benefit of gathering so little information is that the crash reporting process is very fast, making users less impatient. Just checking the version numbers of all the DLLs and COM controls can take quite a while, especially when you factor in the upload time over modems, and very rarely provides useful information. Even if you discover that a certain crash only happens with a certain version of one of Microsoft's system DLLs, what are you going to do about it? You still have to fix the code to work around the crash.

Phoning Home

Thanks to the pervasiveness of the Internet, there's almost always one best way to send the information home: over the web. By sending a standard HTTP request, you will get your bug report past almost any kind of firewall customers may have in place. Virtually every programming environment now has built-in libraries to send an HTTP request and get the response back. For example, on Windows, there are built in functions in the WININET library that use Internet Explorer's network transport code to send an HTTP request and get the response. The best thing about these functions is that even if the user has configured his web browser to go through a proxy server, which is common inside firewalls, the WININET calls will automatically go through the proxy server, with no additional work on your part.

For the response part of the HTTP request, our FogBugz server returns a super short XML file which indicates that the report was received, and includes a message which is displayed to the user. More about that in a moment.

If your application is web based, there's something even easier you can do: display a web page containing a form that submits data to your server. It's as simple as changing the action attribute of the form tag to point to a URL on your bug tracking server. See Figure 2.

Figure 2: the web page that appears when FogBugz is about to crash. The HTML for this page contains a form which sends the complete crash report to a URL at Fog Creek Software that collects the crash data.

For certain types of applications, instead of sending the crash report right away, you may want to try writing it out to a file or the registry, and then sending it the next time the user launches the program. I call this technique delayed transmission. Although this will delay the report a bit, it has the advantage that if the crash was severe enough that the application is too messed up to transmit a bug report, you'll still get the report.

All crash reports arrive at Fog Creek via a single URL on our public-facing server. Our bug tracking database receives bug reports via this unique URL. In fact that URL is the only public access to our database; everything else is locked out, so people can submit bugs, but they can't get into the database.

Figure 3: What a bug reported by BugzScout looks like in FogBugz. Note the BugzScout-specific fields "Scout Msg" and "Scout Will", shown highlighted.

Figure 3 shows what a bug looks like when it arrives in our database. We could have set it up so that the bug report automatically goes to a designated member of the development team, but these days rather than interrupting one person, we've set up a virtual person called "CityDesk New Bug." Every once in a while when we want to sift through crash reports, we can just search for all bugs assigned to this virtual person, and decide whether to fix them or not. The ones that we decide to fix are then assigned to a real person.


Identifying Duplicate Crashes

An important aspect of automatic crash collection is that the same crash will probably happen many times to many people, and you don't really want a new bug in your database for every duplicate of the crash. We handle this by constructing a unique string that contains key elements of the crash data.

We are careful to construct this string in a way such that two people with the same crash generate the same string. After some experimentation, we found that the best way to do this is to include the error number, file name, function name, line number, and the version of our software in that string. In Figure 3, the unique string is "Error 91 (global:IsRoot:0) V1.0.32". That means error number 91 occurred in the file named global.bas, in the function IsRoot, on line 0, running version 1.0 of the software, build 32. Incidentally, we always use even build numbers for internal builds that we don't send out to customers, and odd build numbers for all builds that go to customers, so I can tell at a glance that this particular crash happened to a developer and not to a customer.

FogBugz will automatically append future crashes with the same unique string to this case, rather than opening a new case. This helps the programmer see all the duplicates of the same crash in one place.

Designing the format of the unique string can be tricky. In the past, we included the entire text of the crash error message in this string. However, we soon discovered that the error message was translated into different languages. So for every crash report, we would find out about it separately in English, German, Spanish, French, and a few other languages I can't identify! We solved that problem by putting the error message in the body of the crash report but only putting the unique error number in the title, which doesn't vary from language to language, so we get far fewer dupes.

We also set up the title in such a way that it can be easily searched for particular problems. Since we use the format filename:function:lineNumber (note the colons) in the title, it's easy to search for bugs in a particular function just by searching for ":function:". We prepend the letter V to our version number for the same reason; you can search for V1 or V1.0 or V1.0.32. If we had left out the V, a search for version 1 would yield every bug report that happened to have a 1 anywhere in the title.

Once the bug is identified, we can change a flag (Scout Will in the FogBugz interface) from "Continue Reporting" to "Stop Reporting" after which future crashes with the same unique string will just get ignored. We can even set a text message (Scout Msg in the FogBugz interface appears for automatically submitted cases) which will be sent back to all users in the future who have this crash. We use this to suggest workarounds when we've found them. Like, "Hey! Next time don't forget to pat your head and rub your tummy before you save!"

One common cause of duplicate reports is when a crash occurs in the crash handling code itself. This doesn't necessarily mean the crash handling code is buggy - it might just be because the original crash has scrambled something so badly that no code can run successfully any more.

Debugging

During beta testing periods, we try to look at each crash report right away. When the user provides their email address, developers can hit the Reply button in FogBugz to send them an email message on the spot if they need additional information. FogBugz automatically keeps a copy of all the correspondence related to this bug, incoming and outgoing, in the bug report itself.

Once the product is shipping and developers are working on the next major release, they usually can't find time to look at every incoming crash report. Instead we tend to wait a few months to see what crashes are most common, and we only work on the crashes that occurred most often. The disadvantage is that you can't really correspond with a user asking questions about a crash they had several months ago: they just won't remember enough details. But I've found that if the same crash has happened several times, inevitably one of the users has given me enough clues about what they were doing that I can repro the bug in the lab. Indeed, it's very rare to know that a given line of code has crashed without having a good idea for what the problem might be, even if it's hard to figure out the repro case. Once I literally worked my way backwards through the code doing arithmetic and using applied logic to figure out repro steps. Hmm, if this is crashing, then this value must be negative. If it's negative, then this IF statement must have been true. And so on, until I figured out what combination of values led to the crash and realized what must have caused them.

Triage

As soon as you build an automatic crash reporting system, you're going to get a pretty steady stream of crash reports. So good triage skills - deciding which bugs are most important to fix, and ignoring the others - are more important than usual.

CityDesk has about 20,000 copies in use and most of our customers tell us it's rock solid, but still we get a couple of crash reports every day, but many of them only happened once. When I investigated these, I usually discover various signs of bugs that we're probably never going to fix. For example:

And sometimes, you just can't figure out what caused the crash. Especially with crashes that only happened once. That's life. It's important not to get too bogged down in fixing every crash you see. You can get a lot more bang for the buck by focusing on the common crashes. In fact my policy is that I won't even look at a crash that only happened once. We've got bigger fish to fry. If it's not reproducible in the field, it's not likely to be reproducible in the lab.

Shrink-wrapped Versus In-house Software?

So, now that you've seen what's involved, is crash reporting for you? The answer, to some extent, depends on your user base.

If you're developing shrink-wrapped or consumer software, quality is, quite literally, a competitive advantage. And your software will be running in a hostile environment. Consumer PCs are a mess. No two are the same. They've all got slightly different hardware and software configurations. PC companies ship these things with every imaginable piece of junkware preinstalled. And a lot of consumers gleefully download and install every shiny new object they can get their hands on, including those oh-so-clever utilities that actually inject themselves into the process spaces of other running applications. And most home users don't know enough about computers to keep their systems operating well. In such a hostile environment, automatic crash reporting is the only way to get to the level of quality that the market demands.

On the other hand, if you're developing corporate software for in-house use, you're probably not going to get as much value out of automatic crash reporting. Corporate software is usually written to solve a particular problem, at great expense relative to off-the-shelf software. Once the problem is solved, it's not worth spending any more money on that particular project. If the code crashes once a week, it can be annoying, but there may be no business justification to spend several thousand dollars having a developer fix it. It might be nice but it would not be profitable. Idealistic software developers on in-house projects are often disappointed to discover that as soon as their code is "good enough," their managers tell them to stop working on it. It has solved the business problem, even if the quality could be better, and any marginal work has zero return on investment. Still, many corporate software developers are forced to work literally without any QA or testing staff at all, and automatic crash detection may be the only way to get any kind of bug reports at all.

In conclusion, building a robust system to handle crashes from the field, report them, classify them, and track them will delight your customers and pay for itself many times over in the quality of code that you ship.

How Do I Get This Working With FogBugz?

Included in every FogBugz installation is a file called scoutSubmit.asp (that's scoutSubmit.php in non-Windows versions of FogBugz).  This file is the entry point for all automatic bug submissions.  You can create an HTTP post with the correct form values and post directly to this file on your web server.  Alternatively, you can use our free BugzScout ActiveX control which will package up the values for you and submit them via HTTP to your FogBugz installation. See the this article for more details about the BugzScout files that are included with FogBugz. 

Appendix One: Handling Crashes in Visual Basic Code

Because a typical Visual Basic 6.0 event-driven program has so many entry points (one for each event that you handle), the only way to catch crashes which occur anywhere in the code is to add error handling to every function. Here's what all our functions look like:

Private Sub cmd_Click()

On Error GoTo ERROR_cmd_Click
' the actual code for the function '

Exit Sub

ERROR_cmd_Click:

HandleError "moduleName", "cmd_Click"

End Sub

Adding this code can be a pain; luckily there's a utility you can get called ErrorAssist which will add error-trapping code to all your functions automatically. In every case, we just call a global function called HandleError which displays our custom crash dialog. The crash dialog contains the BugzScout ActiveX control. This is a small ActiveX control which comes free with FogBugz, which can be used to transmit bug reports into a FogBugz bug tracking database.

Appendix Two: Handling Crashes in Windows API Code

The Win32 API contains a concept called structured exception handling. When a crash occurs, Windows searches for the current unhandled exception handler function and calls it. If there is no unhandled exception function, it will display the usual user-aggressive "This program has performed an illegal operation" dialog box.

To install your own unhandled exception function, you have to do two things. First, implement your own function of the form

LONG UnhandledExceptionFilter( STRUCT _EXCEPTION_POINTERS *ExceptionInfo );

Then call the SetUnhandledExceptionFilter function, passing in a pointer to your UnhandledExceptionFilter function.

Another way to accomplish the same thing from C++ code is to surround main entry point with a __try/__except clause. Notice the two underlines which cause the compiler to handle structured exceptions which come from low level failures like dereferencing null pointers, not the garden variety C++ exceptions you throw yourself and handle with try and catch.

Search for Using Structured Exception Handling in the Windows Platform SDK or MSDN for more details.

Appendix Three: Handling Crashes in ASP Applications

Microsoft's Internet Services Manager lets you set up a custom error handling page, either HTML or ASP, which is processed for any scripting errors that aren't handled with On Error statements. In particular, when an ASP (Active Server Pages) application crashes or has an unhandled error of any sort, the page is redirected to the error handler for "500;100" errors. In all our applications, we have an ASP page set up to catch 500;100 errors.

That page contains the following key bit of VBScript code:

Set objASPError = Server.GetLastError

Now you'll have an object called objASPError which contains lots of useful data about the crash that just occurred, including the file and the line number.

Sample code for handling ASP errors comes preinstalled on every computer with IIS: look in the \windows\help\iishelp\common directory for a file called 500-100.asp. This simply displays the details of the ASP error to the end-user. Using the 500-100.asp file as a starting point, you can create your own customized message page, containing a form with hidden elements containing the crash data. The form should have an action attribute which submits all the error information to another web page. If you're using FogBugz, you just direct the form to

http://<your FogBugz URL>/scoutSubmit.asp

to open a new case in your bug database.