|
August 13th, 2007, 03:16 Posted By: Wally
StrmnNrmn stopped by his blog to post an interesting article about R13 which includes things like trampolines and nice speed ups.
To sum it all up he said he got close to a 15% speed up.
Heres the article:
Dynarec Improvements
I've had a fairly productive week working on optimising the Dynarec Engine. It's been a few months since I worked on improving the code generation (as opposed to simply fixing bugs), so it's taken me a while to get back up to speed.
At the end of each fragment, I perform a little housekeeping to check whether it's necessary to exit from the dynarec system to handle various events. For instance, if a vertical blank is due this can result in me calling out to the graphics code to flip the current display buffers. The check simply involves updating the N64's COUNT register, and checking to see whether there are any time-dependent interrupts to process (namely vertical blank or COMPARE interrupts.)
I had an idea on the train into work on Monday I realised that there were a couple of ways in which I could make this more efficient. Firstly, the mechanism I was using to keep track of pending events was relatively complex, involving maintaining doublely-linked lists of events. I realised that if I simplified this code it would make it much easier for the dynarec engine to update and check this structure directly rather than calling out to C code.
The other idea I had on the train was to split up the function I was calling to do this testing into two different versions. There are two ways that the dynarec engine can be exited - either through a normal instruction, or a branch delay instruction (i.e. an instruction immediately following a branch.) My handler function catered for both of these cases by taking a flag as an argument. I realised that by providing a separate version of this function for each type I could remove the need to pass this flag as an argument, which saved a couple of instructions from the epilogue of each fragment.
These two small changes only took a couple of hours to implement, but yielded a 3-5% speedup on the various roms I tested. They also slightly reduced the amount of memory needed for the dynarec system, improving cache usage along the way.
The next significant optimisation I made this week was to improve the way I was handling the code generation for load/stores. Here's what the generated code for 'lw $t0, 0x24($t1)' looks like in Daedalus R12 (assume t1 is cached in s1, and t0 is cached in s0 on the PSP):
ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0<s6) # compare to upper limit
ADDU a1 = a0 + s7 # add offset to emulated ram
BNEL t0 != r0 --> cont # valid address?
LW s0 <- 0x0000(a1) # load data
J _HandleLoadStore_XYZ123 # handle vmem, illegal access etc
NOP
cont:
# s0 now holds the loaded value,
# or we've exited from dynarec with an exception
There are a couple of things to note here. Firstly, I use s6 and s7 on the PSP to hold two constants throughout execution. s6 is either 0x80400000 or 0x80800000 depending on whether the N64 being emulated has the Expansion Pak installed. s7 is set to be (emulated_ram_base - 0x80000000). Keeping these values in registers prevents me from using them for caching N64 registers, but the cost is far outweighed by the more streamlined code. As it happens, I also use s8 to hold the base pointer for most of the N64 CPU state (registers, pc, branch delay flag etc) for the same reason.
So the code first adds on the required offset. It then checks that the resulting address is in the range 0x80000000..0x80400000, and sets t0 to 1 if this is the case, or clears it otherwise*. It then adds on the offset (emulated_ram_base - 0x80000000) which gives it the translated address on the psp in a1. The use of BNEL 'Branch Not Equal Likely' is carefully chosen - the 'Likely' bit means that the following instruction is only executed if the branch is taken. If I had used a plain 'BNE', the emulator could often crash dereferencing memory with the following LW 'Load Word'.
Assuming the address is out of range, the branch and load are skipped, and control is passed to a specially constructed handler function. I've called it _HandleLoadStore_XYZ123 for the benefit of discussion, but the name isn't actually generated, it's just meant to indicate that it's unique for this memory access. The handler function is too complex to describe here, but it's sufficient to say that it returns control to the label 'cont' if the memory access was performed ok (e.g. it might have been a virtual address), else it bails out of the dynarec engine and triggers an exception.
When I originally wrote the above code I didn't think it was possible to improve it any further. I didn't like the J/NOP pair, but I saw them as a necessary evil. All 'off trace' code is generated in a second dynarec buffer which is about 3MiB from the primary buffer - too far for a branch which has a maximum range of +/-128KiB. I used the BNEL to skip past the Jump 'J' instruction which can transfer control anywhere in memory.
What I realised over the weekend was that I could place a 'trampoline' with a jump to the handler function immediately following the code for the fragment. Fragments tend to be relatively short - short enough to be within the range of a branch instruction. With this in mind, I rewrote the code generation for load and store instructions to remove the J/NOP pair from the main flow of the trace:
ADDIU a0 = s1 + 0x0024 # add offset to base register
SLT t0 = (a0<s6) # compare to upper limit
BEQ t0 != r0 --> _Trampoline_XYZ123 # branch to trampoline if invalid
ADDU a1 = a0 + s7 # add offset to emulated ram
LW s0 <- 0x0000(a1) # load data
cont:
# s0 now holds the loaded value,
# or we've exited from dynarec with an exception
#
# rest of fragment code follows
# ...
_Trampoline_XYZ123:
# handler returns control to 'cont'
J _HandleLoadStore_XYZ123
NOP
The end result is that this removes two instructions from the main path through the fragment. Although in the common case five instructions are executed in both snippets of code, the second example is much more instruction cache friendly as the 'cold' J/NOP instructions are moved to the end of the fragment. I've heard that there is a performance penalty for branch-likely instructions on modern MIPS implementations, so it's nice to get rid of the BNEL too.
As with the first optimisation, this change yielded a further 3-5% speedup.
The final optimisation I've made this weekend is to improve the way I deal with fragments that loop back to themselves as they exit. Here's a simple example:
8018e014 LB t8 <- 0x0000(a1)
8018e018 LB t9 <- 0x0000(a0)
8018e01c ADDIU a0 = a0 + 0x0001
8018e020 XOR a2 = t8 ^ t9
8018e024 SLTU a2 = (r0<a2)
8018e028 BEQ a2 == r0 --> 0x8018e038
8018e02c ADDIU a1 = a1 + 0x0001
8018e038 LB t0 <- 0x0000(a0)
8018e03c NOP
8018e040 BEQ t0 == r0 --> 0x8018e058
8018e044 NOP
8018e048 LB t1 <- 0x0000(a1)
8018e04c NOP
8018e050 BNE t1 != r0 --> 0x8018e014
8018e054 NOP
I'm not sure exactly what this code is doing - it looks like a loop implementing something like strcmp() - but it's one of the most executed fragments of code in the front end of Mario 64.
The key thing to notice about this fragment is that the last branch target loops back to the first instruction. In R12, I don't perform any specific optimisation for this scenario, so I flush any dirty registers that have been cached as I exit, and immediately reload them when I re-enter the fragment. Simplified pseudo-assembly for R12 looks something like this:
enter_8018e014:
load n64 registers into cached regs
perform various calculations on cached regs
if some-condition
flush dirty cached regs back to n64 regs
goto enter_8018e038
perform various calculations on cached regs
flush dirty cached regs back to n64 regs
if ok-to-continue
goto enter_8018e014
exit_8018e014:
...
enter_8018e038:
...
The key thing to notice is that we load and flush the cached registers on every iteration through the loop. Ideally we'd just load them once, loop as much as possible, and then flush them back to memory before exiting. I've spent the day re-working the way the dynamic recompiler handles situations such as this. This is what the current code looks like:
enter_8018e014:
load n64 registers into cached regs
mark modified regs as dirty
loop:
perform various calculations on cached regs
if some-condition
flush dirty cached regs back to n64 regs
goto enter_8018e038
perform various calculations on cached regs
if ok-to-continue
goto loop
flush dirty cached regs back to n64 regs
exit_8018e014:
...
enter_8018e038:
...
In this version, the registers are loaded and stored outside of the inner loop. They may still be flushed during the loop, but only if we branch to another trace. Before we enter the inner loop, we need to mark all the cached registers as being dirty, so that they're correctly flushed whenever we finally exit the loop.
This new method is much more efficient when it comes to handling tight-inner loops such as the assembly shown above. I still have some work to do in improving my register allocation, but the changes I've made today yield a 5-6% speedup. Combined with the other two optimisations I've described, I'm currently seeing an overall 10-15% speedup over R12.
I'm quite excited about the progress I've made so far with R13. I still have lots of ideas for other optimisations I want to implement for R13 which I'll talk about over the coming days. I don't have any release date in mind for R13 at the moment, so there's no point in asking me yet
-StrmnNrmn
*The SLT instruction is essentially doing 'bool inrange = address >= 0x80000000 && address < (0x80000000+ramsize)'. I think the fact that this can be expressed in a single instruction is both beautiful and extremely fortunate
For more information and downloads, click here!
There are 159 comments - Join In and Discuss Here
|
|