I was playing around with the code generation a couple of evenings ago, and realised that if I made a certain assumption, I could drastically speed up specific types of memory accesses.
When I discussed load/store handling on Sunday, I presented the new code that is typically generated for handling a load such as 'lw $t0, 0x24($t1)' on the N64:
ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0 BEQ t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data
(I'll ignore all the extra code which is generated, and just concentrate on the 5 instructions above which correspond to the expected path of execution.)
Of the 5 instructions that are generated, two - the SLT and BEQ - are just there for performing error handling in the case that the address is invalid, a hardware register (i.e. for memory-mapped I/O), or a virtual address. I'll call this error handling for short.
If we were generating code in an idealised environment where we didn't have to perform error handling, we could drop the SLT/BEQ instructions to get this:
ADDIU a0 = s1 + 0x0024 # add offset to base register
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data
We could then optimise this even further and perform the offset calculation directly as a part of the LW instruction:
ADDU a1 = s1 + s7 # add offset to emulated ram
LW s0 <- 0x0024(a1) # load data
In this idealised situation we could reduce an emulated load to just two instructions, with no branches. That's a pretty good saving!
The problem is that the environment we're generating code from is not 'ideal', and it's hard to know in advance of time which memory accesses are going to directly access physical ram, and which are going to access hardware registers or require virtual address translation. For that reason, we have to place a guard around every memory access to make sure that it behaves correctly. At least, that was the way I was thinking until earlier in the week.
What I realised on Monday is that I can make an assumption that lets me remove the error handling code for certain types of load/stores. The assumption is that when the N64 accesses any memory through the stack pointer ($sp) register, the address is always going to be valid, physical memory.
The assumption relies on the fact that most roms don't do anything particularly clever with their stack pointers - it gets set up for each thread to point at a valid region of memory then the game just runs along, pushing and popping values from it as the code executes. Of course, if the assumption is wrong then the emulator will just crash and grind to a halt in a unpredictable manner
It was straightforward to add a hack to the code generation to exploit this kind of behaviour, and the results have been better than I expected - I'm seeing at least a 10% speed up, and the code expansion factor (the ratio of generated bytes of instructions to input bytes) has dropped from around 5.0x to 4.0x. Stability has been excellent too - I've run about 8 roms with the hack so far, and all of them have run perfectly.
I think one of the reasons the hack has such an impact is that a lot of the memory accesses in a typical C program are through the stack. Here's an example snippet from the entry to a function on the N64, where the compiler emitted code to store the return address and arguments:
SW ra -> 0x0014(sp)
SW a0 -> 0x0058(sp)
SW a1 -> 0x005c(sp)
SW a2 -> 0x0060(sp)
When I look through disassembly for the roms I'm working on, it's very common to see lots of sequential loads/stores relative to the stack pointer like this.
Previously Daedalus generated around 20 instructions (including 5 branches) for the above snippet. With the hack, the generated code now looks like this:
ADDU t1 = s1 + s7
SW s4 -> 0x0014(t1)
ADDU t1 = s1 + s7
SW s3 -> 0x0058(t1)
ADDU t1 = s1 + s7
SW s2 -> 0x005c(t1)
ADDU t1 = s1 + s7
SW s5 -> 0x0060(t1)
8 instructions, 0 branches. What's more, it looks like with a little work, I could eliminate 3 redundant address calculations:
ADDU t1 = s1 + s7
SW s4 -> 0x0014(t1)
SW s3 -> 0x0058(t1)
SW s2 -> 0x005c(t1)
SW s5 -> 0x0060(t1)
Now that would be efficient
I still want to do lots of testing with the hack. I want to find out if there are roms that don't work with the hack enabled, and how common a problem it is. It's such a significant optimisation though that I'm certain I'll be adding it as an option in Daedalus R13. The results of my testing will probably determine whether I default it to on or off though.
So far Daedalus R13 is shaping up to be significantly faster than R12. I'm still not sure when I'll be ready to release it, but you'll hear about it here first.
-StrmnNrmn