Monday 29 December 2014

Emulation of Z80 register sets

Way back in 2007, I wrote a Z80 emulation front-end for QEMU. Being based upon dynamic translation, QEMU has a set of front-ends for allowing it to emulate a range of CPUs. The front-ends decode portions of a program’s machine code the first time they run, and make calls to the code generator, which was ‘dyngen’ at the time that I started, but is now ‘TCG’ (Fabrice Bellard’s ‘Tiny Code Generator’).

TCG also requires a back-end for generating machine code for the system upon which the emulator is to be run. For most people, this is probably some variety of x86 (32-bit or 64-bit), although hopefully, in future, the ARM and MIPS back-ends will see more use. Writing a back-end is non-trivial, but a front-end is typically a far more substantial piece of software. (Whilst a TCG back-end for the Z80 is quite feasible, porting QEMU as a whole would be little interest due to a lack of sufficiently powerful Z80-based hardware.) Dyngen was a most amazing hack that ‘worked’ by taking pre-compiled template functions, chopping their prologues and epilogues (entry/exit code) and then simply pasting these together to generate code. Dyngen’s back-end was quite simply GCC itself. Unfortunately, this approach was fragile, and so dyngen was replaced with TCG to allow support for GCC 4, which had changed the way function prologues and epilogues were generated.

When writing the Z80 target, I was faced with various questions as to how to best model the Z80’s register set, which I have still not resolved. The Z80 allows its 8-bit registers to be accessed as register pairs, of 16 bits in length. The B and C registers can be paired together as BC, the D and E registers as DE, the H and L registers as HL, IXh and IXl as IX and IYh and IYl as IY. The A (accumulator) and F (flags) registers can also be accessed as the register pair AF, and sometimes it is useful or necessary to do this (e.g. there is a PUSH AF instruction, but not such thing as PUSH A). The Z80 also has a shadow register set, and register pairs in the shadow set can be ‘swapped’ with those in the main set using the EX AF,AF′ and EXX instructions. I’ll discuss these shadow registers in a subsequent post but for now, I’ll concentrate on the question of whether to model the registers as 8-bit registers or as 16-bit register pairs.

For most operations that use or update the value in AF, I decided to model them as individual registers. Flags are often set independently of the accumulator, so if we modelled the two as a single register pair, we would incur extra processing to preserve the value of A when updating AF. That said, the flags are often only partially updated, so partial updates would simply need to be extended to preserve the full value of A. What complicates matters is that many flag updates can be optimised out entirely, where it can be shown that the flag values that would be calculated will simply be replaced with entirely new values without being checked or stored in any way, as QEMU already implements for x86. The question that is in need of answering here is whether this system of flag optimisation could be sensibly updated to work with an AF register pair instead of the individual F register. One concern is that the regular updates of the A register that occur in pretty much any Z80-based program could slow down the flag optimisation pass during code generation sufficiently to outweigh any benefit in handling AF as a register pair, but there are other detrimental effects that this could have such as poor interaction with TCG’s optimiser.

I modelled the remaining register pairs, BC, DE, HL, IX and IY, as 16-bit register pairs rather than as individual 8-bit registers, as it seemed to me that they would be used frequently enough as pairs for this to be worthwhile, but I made sure that it was a simple change to a #define to switch to handling them as 8-bit registers, so that I could properly evaluate this later on. Operations such as INC H can be easily implemented as HL = HL + 256, so there is some hope for this approach.

That said, there is the question as to how TCG’s optimiser would deal with setting of high and low parts of a register through a sequence such as ‘XOR A; LD D,A; LD E,A; LD H,A; LD L,A’. ‘XOR A’ sets A to be the exclusive-or of itself with itself, i.e. 0, as a single-byte instruction (with the side-effect of updating F). We then set the D, E, H and L registers individually with single-byte instructions. There isn't an ‘LD HL,DE’ instruction although it would make a lot of sense as a synthetic instruction, although you’d have to keep in mind that it could be interrupted mid-way. Even without handling ‘XOR A’ a special case in the front-end, TCG’s constant folding pass will recognise that the instruction sets A to a constant value of 0. This then makes its way through TCG’s constant propagation pass. Ignoring flags (which will often be optimised out), we would have something like this: ‘A = 0; D = 0; E = 0; H = 0; L = 0’. With register pairs, this might expand to something like: ‘A = 0; DE = (((DE & 0x00ff) | (0 << 8)) & 0xff00 | 0); HL = (((HL & 0x00ff) | (0 << 8)) & 0xff00) | 0’ which then would be optimised into: ‘A = 0; DE = 0; HL = 0’.

If we were to load any value other than a 0, or if TCG were unable to identify the value in A as being a constant (say, if the XOR A instruction happened in a different block of code, perhaps before a conditional jump) then I would expect that ‘LD D,H; LD E,L’ could easily turn into ‘DE = (((DE & 0x00ff) | (H << 8)) & 0xff00) | L’. It would probably not be too hard to train the optimiser into recognising the set of bits within a TCG variable that are constant, rather than handling variables as being fully constant or fully variable, but it’s likely that this would be of marginal benefit for a lot of architectures. Still, 16-bit x86 allows access to 8-bit AH and AL values (even if the top 16-bits isn’t so easily accessed in 32-bit x86, nor the top 32-bits in 64-bit x86), so the Z80 is by no means alone in this.

No comments:

Post a Comment