Tracking down the SSB Dynarec Bug
Yesterday I said I'd provide some more details about the Super Smash Bros. dynarec fix. The actual fix is fairly straightforward, but I thought the process of tracking down the issue was quite interesting and worthy of a couple of blog posts.
When I first started looking at SSB I noted that although the game ran fine without dynarec, it would always hang when trying to enter the main entry with dynarec enabled.
I've been programming professionally for around 6 years now and I can safely say that debugging dynarec bugs is one of the hardest categories of problems I've ever had to work on. For a start, because the code is generated on the fly, you don't have the luxury of source level debugging, and without spending time reverse engineering the original rom image, you don't even know what the generated dynarec code is meant to be doing. It's very much like working blindfolded.
And it gets even worse. I've fixed dynarec problems in the past which were the result of generating incorrect code for a fragment over 500 million instructions into emulation. This would be bad enough, but it can be many thousands of instructions later before this causes emulation finally diverges from the correct path. Just identifying the exact point at which the emulation starts to diverge from the correct sequence of instructions can be like finding a needle in particularly large haystack. While blindfolded
Over the years of trying to debug problems like these I've built up a set of tools and learned a few tricks along the way which you might find quite interesting. Although I'm going to talk about them in the context of tracking down this dynarec issue, I've found some of the techniques useful in solving other problems so you might find other ways of applying them too.
One of the first things I do when trying to identify a dynarec issue with Daedalus is to see if the problem is reproducible on the PC build of the emulator. Although it is possible to use GDB with PSPLink, I've never got this up and running and I'm much more comfortable debugging with Visual Studio. Also, working with the PC build is usually much faster than working with the PSP build (debug builds run around 10x faster on the PC, and build times are much quicker.)
Not all dynarec issues can be debugged in this way - the PSP and PC builds have different code generation back-ends (i.e. MIPS and x86 code generation respectively) so bugs in the MIPS code generation won't usually be reproducible in the PC build. The dynarec system in Daedalus shares a common frontend (trace selection and recording) between the two platforms, which means that if I can reproduce the problem on both platforms, I can narrow down the likely location of the bug to this area.
Fortunately this particular bug manifested itself in both the PC and the PSP builds, so I knew that if I fixed the bug on the PC build, it should fix the PSP build too. What I needed to find out next is what the emulator was doing differently when dynarec was enabled compared to when it was disabled.
If dynarec is running without errors, then the sequence of executed instructions should exactly match that executed with dynarec disabled. If I could log details about all the instructions executed with dynarec disabled, and again with dynarec enabled, I should be able to compare the two logs to figure out the exact point at which dynarec is going out of sync. This all relies on the fact that the emulator is totally deterministic, i.e. that running the emulator twice in succession with the same settings should give exactly the same results.
Unfortunately, for a variety of reasons my dynarec solution doesn't produce identical results to interpretation, the main reason being that for performance reasons I can only handle vertical blank and timer interrupts on the boundaries between fragments. For example, with dynarec disabled, the first vertical blank interrupt might occur exactly on the 625,000th instruction, but with dynarec enabled with might not occur until the 625,015th instruction. This means that the logs diverge at the instant the first VBL fires, and never regain synchronisation.
When I was originally developing the new dynarec system I put a lot of effort into writing a fragment simulator, the idea being that rather than executing the native assembly code for a given trace, I could keep track of the instructions making up the trace and interpret these individually instead. Theoretically fragment simulation is identical to dynarec code execution, even down to the way I handle VBLs and timer interrupts, and it's been very useful at identifying bugs in the dynarec code generation. What's particularly useful about fragment simulation however is that I can enable a setting which makes it handle interrupts exactly in the same way as the non-dynarec core, i.e. interrupts are handled precisely rather than on fragment boundaries.
Essentially Daedalus has four modes of operation:
* Dynarec + fragment execution
* Dynarec + fragment simulation (imprecise interrupt handling)
* Dynarec + fragment simulation (precise interrupt handling)
* Interpretative core
This tool is particularly powerful, because if I can ensure that dynarec+fragment execution is equivalent to dynarec+fragment simulation, and that dynarec+fragment simulation is equivalent to running the interpretative core, then I can use the transitive properties of these relations to ensure that dynarec+fragment execution is equivalent to running the interpretative core. Fragment simulation allows me to bridge the gap between these two modes of operation which would otherwise be very difficult to compare.
I think that's long enough for one post. Tomorrow I'll talk about how I used this technique to help track down the SSB dynarec bug.
-StrmnNrmn