Monday, 29 December 2014

Emulation of Z80 shadow registers

Following on from my previous posting, where I discussed the question of how to model the Z80’s register pairs in QEMU, an emulator using dynamic code translation, this post raises the question of how to handle the Z80’s shadow register set. Firstly, I refer to Ken Shiriff’s blog post on how the Z80’s register file is implemented:
Note that there are two A registers and two F registers, along with two of BC, DE, and HL. The Z80 is described as having a main register set (AF, BC, DE, and HL) and an alternate register set (A′F′, B′C′, D′E′, and H′L′), and this is how the architecture diagram earlier is drawn. It turns out, though, that this is not how the Z80 is actually implemented. There isn’t a main register set and an alternate register set. Instead, there are two of each register and either one can be the main or alternate.
The shadow registers can be quickly swapped with main ones by means of the EX AF,AF′ and EXX instructions. EXX swaps BC with BC′, DE with DE′ and HL with HL′ but does not swap AF with AF′. (This is even the source of a bug affecting MGT’s DISCiPLE interface, which would fail to properly save or restore the AF′ register when returning to a program). The IX and IY registers are not shadowed. There’s also an EX DE,HL instruction which swaps DE and HL, but does not swap DE′ and HL′ (i.e. it only swaps DE and HL in the register set you’re currently using).

In a conventional emulator, you’d implement EX and EXX by swapping the main register set with the shadow set as might be suggested by many ‘logical’ descriptions of those instructions, rather than modelling the Z80’s approach of using register renaming through the use of four flip-flops, as described by Ken. There are a few reasons for doing it this way. Firstly, there’s a nasty patent out there which prohibits using pointers (or references) to switch between alternate register sets. Take something that’s been commonplace in hardware for donkey’s years and sprinkle the word ‘software’ on it (and maybe ‘pointer’ in place of ‘multiplexor’) and BANG — instant patent... anyway, I digress.

The other reason for this approach is one of performance. Typically a Z80 emulator will have separate code for handling every possible form of an instruction. That is to say, INC BC and INC DE are treated as completely unrelated instructions, even though their implementations are almost identical. The main loop of the emulator typically consists of a massive switch() statement with cases for the 256 possible byte values that might be read from at the address held in the program counter. We perform instruction decoding and register decoding with a single switch() statement, which most likely is compiled into a jump table. The last thing we would want to do is introduce unnecessary indirection in determining which register set to access. Note that for DE and HL, we would need two levels of indirection, one for selecting between the shadow register sets and another for determining whether DE and HL have been swapped in the currently active register set.

(Note that Fuse, the Free Unix Spectrum Emulator, has a Perl script to generate the big 256-case switch statements, so as to avoid duplication of source code and any possible inconsistencies between cases (e.g. a typo affecting the emulation of an instruction for only one particular register). This script is pre-run before any source release, so developers running on Windows need not be too concerned unless they plan on modifying the Z80 core itself, and contributing their changes back.)

By swapping registers when emulating EX AF,AF′, EX DE,HL and EXX, we avoid all this extra expense when handling the majority of instructions, and incur a degree of additional overhead for the exchange instructions themselves. Programs are highly unlikely to use the exchange instructions frequently enough for this to be a net slowdown, and even if they did, performance would still be adequate as the cost of actually swapping the registers is not especially high. It’s pretty much a no-brainer: almost every Z80 emulator out there does it this way, and for good reason. I even did it this way in my Z80 front-end for QEMU because it was the simplest approach, because I had a few unresolved questions regarding the use of register naming, and because I'm not sure how well that patent I mentioned applies to a dynamic translator. (I suspect it doesn't, but IANAL. And ‘Software!’...)

So, why even consider the alternative? Well, for an emulator built upon dynamic translation, such as QEMU, the cost of implementing register renaming as is actually implemented by the Z80 will only be incurred at translation-time (i.e. the first time the code is run), and not at execution-time (i.e. every time the code is run). What’s more, the conventional approach may incur a cost at execution-time which could be saved through the use of register renaming.

Within a translation block (only one tiny part of the program is translated at a time, and translation is performed only when the code actually needs to execute), TCG will track the use of registers to hold results and perform its own (implicit) register renaming similar to that of many modern CPUs (rather than the explicit renaming through the use of an instruction, used by the Z80), so an instruction sequence such as ‘EXX; SBC HL,BC; EXX’ can easily be transformed into the hypothetical SBC HL′,BC′ instruction. Problem solved, then? EXX just boils away so we can stick with the conventional approach? No, as it turns out. This example only works because we swap the register sets back again before the end of the translation block. If we didn’t, QEMU would still need to perform the full exchange operation at some point within the block of code that is generated.

Register renaming would be implemented within the QEMU Z80 target as an extra level of indirection during translation, that applies when decoding Z80 instructions. The four separate flip-flops that Ken describes would become four separate flags as part of our disassembly state, which would indicate which set of AF registers is in use, which set of BC/DE/HL registers is in use, and whether DE and HL have been swapped for each of the two register sets. If these flags are grouped together, we would then have register translation tables of sixteen entries, which would be indexed by the current ‘exchange state’. Some of these could be combined with the existing register pair lookup tables (see regpair[] and regpair2[] in translate.c), which would simply gain an additional index. The EX and EXX instructions would then simply toggle bits in the disassembly state. EXX would then either need to swap the DE/HE flip-flops, or EX DE,HL could perform slightly more work in determining which flip-flop to modify.

The slight pit-fall is that if a single EXX appears within a block of code, and that code loops, iterations will alternate in their renaming of the BC/DE/HL registers. To the code translator, this is now considered to be different code which must be translated again. In the somewhat pathological but quite plausible case of going through all sixteen exchange states for large parts of the program, we now end up going through dynamic translation sixteen times instead of just once. We also take a hit in performance in executing previously translated code blocks, as the TB lookup table will have sixteen times as many entries, and our chances of finding a match in the TB lookup cache are much reduced.

Many programs may benefit from renaming without running through all of the exchange states, and even for those that do, the extra compilation may turn out to be worth the expense. The only real way to find out is to try it, but it is also worth considering a hybrid of the two approaches. EX AF,AF′ and EX DE,HL are both short enough operations that we might simply choose to emulate them in the conventional manner. As EXX operates on three register pairs at a time, we might limit renaming to this single exchange. In the worst case, we then only generate twice as much code, but we would gain an advantage in programs that use each register set for separate purposes but still switch between them frequently.

No comments:

Post a Comment