Monday 29 December 2014

Shellshock and Bash

This has been discussed pretty widely already, but I include here my views on ShellShock that I made in a private post to Facebook on 26th September 2014. It might seem loosely worded in places but keep in mind that this is intended to be read by an audience of mixed technical ability. My own recollection is of quite a lot of panic and some amount of technical discussion of the problem, but very little discussion of the practices that allowed this to happen.

Firstly, from my initial posting upon finding out about the bug, a short comment on the state of the software industry:
This ‘ShellShock’ bug that’s got the net in a panic is amazing. It’s 22 years old, affects sites all over the Internet and it’s almost unthinkable that it could exist. The reason is the software that’s affected, Bash, is so cumbersome that hardly anyone has bothered to study it. I might start posting more about the sorry state of the software industry (and the surveillance industry), but I’ll try not to make it seem like ‘boring technical stuff’. This is about governance of infrastructure that affects us all. If your water, gas or electric supply were so badly mismanaged, you’d all be up in arms!
 Then, a look into Bash and Shellshock:
What is Bash? Well the name is from the appalling geek humour: “Bourne-Again SHell”. Steve Bourne was the person at Bell Labs who wrote the original shell, and Bash is a rewrite with some extra bits and pieces. The original was okay, but the rewrite has been widely criticised as overly complex and slow. There’s been some pressure to replace it, but this has been resisted by those preferring to take the “easy option”. Many will not have learnt from this that they were taking the hard option all along, and will continue to make the same mistakes.
In computing, a “shell” is just the part of the user interface that comes built into the system. In Windows this would include your Taskbar and Explorer and on a Mac, it’d include the Dock and Finder. This also includes the “open file”, “save file”, “print”, “are you sure?” boxes, and that sort of thing. 
The shell we’re talking about here is much simpler than this. You type the name of a program, press enter, and it runs it. From MS-DOS, maybe you’ve heard of “COMMAND.COM”, or perhaps you’ve seen “CMD.EXE” in Windows. They’re basically the same thing. 
That sounds like a pretty simple thing, right? Well, sure, but it doesn’t sound like a very friendly user interface, does it? Well, it isn’t. As a first step, we might want to run some sequence of programs and have that sequence stored in a list (such as “copy music from CD” and then “add files to music database”). Then again, we may not always want exactly the same sequence in every situation (maybe if I have my MP3 player attached, I want my music copying there, too). Before we even know it, we’ve ended up with what is really just a very badly designed programming language, and not the simple shell that was originally intended. 
This is where it all goes horribly wrong. Due to an almighty ‘oversight’, what is mainly used as a very simple program for running other programs will actually look for instructions to obey in any of the settings that are supplied to it. These settings (which we call “environment variables”) might well be something that a user has typed in to a web page before hitting “submit”, so where a web server just needs to run the shell to send off an email, which might need to be given NAME and ADDRESS settings to compose the mail, there might be nothing stopping the the user from entering their name as “() {:;}; rm -fr /” (define a subroutine that does nothing, and then delete all files on the system). 
We seem to be caught in a combination of two attitudes: 
  • Conservatism: “Don’t fix it if it’s working, even if it’s probably broken and we just don’t know it yet. We can’t afford to make changes that might have adverse effects [especially if this disrupts our customers at all]”.
  • Recklessness: “We need more cool features and we can always fix it later. Our programming isn’t confusing, you’re just not very good at understanding it”.
Two of the few systems that have replaced Bash with something simpler are Debian, which is a community-run project without commercial concerns, and Ubuntu, upon which it is based. It's interesting that they would take the plunge and advise some caution about upgrading, whereas Apple (Mac OS X), Red Hat, openSUSE and various other systems have not made this decision. 
Much of the problem with Bash would have been easily avoided if its extra behaviour was disabled unless specifically asked for, as we could then have switched to something simpler with much greater confidence. In fact, some means of allowing Bash to throw up warnings when this extra behaviour is used might not be a bad thing to have. I’ve had to get rid of it from systems during my career and it has been frustrating to remove the ‘Bashisms’ that programmers leave in without knowing or caring. 
An especially frustrating part was dealing with software given to us by suppliers, knowing full well that my counterparts in other companies would be going through the same frustrations, not being allowed to spare them the frustration by working with the supplier and knowing that the supplier would not be terribly interested or responsive in dealing with this themselves or even in accepting changes that I would provide to them.
A friend of mine pointed out that the BSDs now avoid using Bash as a system shell, but that only newer versions (and not older versions) of OS X use Bash. My view: “To change their system shell to Bash seems a moronic thing for Apple to have done.”

As an aside, anyone who has even the slightest level of familiarity with C who hasn’t seen the original Bourne Shell source code should read about it now — but make sure that you have some Optrex and Mindbleach to hand, first, though! Steve Bourne used C macros to #define C into something that can at best be described as a cross between C, shell script and Algol. I suppose somebody had to do it first so that the rest of us could learn not to.

Emulation of Z80 shadow registers

Following on from my previous posting, where I discussed the question of how to model the Z80’s register pairs in QEMU, an emulator using dynamic code translation, this post raises the question of how to handle the Z80’s shadow register set. Firstly, I refer to Ken Shiriff’s blog post on how the Z80’s register file is implemented:
Note that there are two A registers and two F registers, along with two of BC, DE, and HL. The Z80 is described as having a main register set (AF, BC, DE, and HL) and an alternate register set (A′F′, B′C′, D′E′, and H′L′), and this is how the architecture diagram earlier is drawn. It turns out, though, that this is not how the Z80 is actually implemented. There isn’t a main register set and an alternate register set. Instead, there are two of each register and either one can be the main or alternate.
The shadow registers can be quickly swapped with main ones by means of the EX AF,AF′ and EXX instructions. EXX swaps BC with BC′, DE with DE′ and HL with HL′ but does not swap AF with AF′. (This is even the source of a bug affecting MGT’s DISCiPLE interface, which would fail to properly save or restore the AF′ register when returning to a program). The IX and IY registers are not shadowed. There’s also an EX DE,HL instruction which swaps DE and HL, but does not swap DE′ and HL′ (i.e. it only swaps DE and HL in the register set you’re currently using).

In a conventional emulator, you’d implement EX and EXX by swapping the main register set with the shadow set as might be suggested by many ‘logical’ descriptions of those instructions, rather than modelling the Z80’s approach of using register renaming through the use of four flip-flops, as described by Ken. There are a few reasons for doing it this way. Firstly, there’s a nasty patent out there which prohibits using pointers (or references) to switch between alternate register sets. Take something that’s been commonplace in hardware for donkey’s years and sprinkle the word ‘software’ on it (and maybe ‘pointer’ in place of ‘multiplexor’) and BANG — instant patent... anyway, I digress.

The other reason for this approach is one of performance. Typically a Z80 emulator will have separate code for handling every possible form of an instruction. That is to say, INC BC and INC DE are treated as completely unrelated instructions, even though their implementations are almost identical. The main loop of the emulator typically consists of a massive switch() statement with cases for the 256 possible byte values that might be read from at the address held in the program counter. We perform instruction decoding and register decoding with a single switch() statement, which most likely is compiled into a jump table. The last thing we would want to do is introduce unnecessary indirection in determining which register set to access. Note that for DE and HL, we would need two levels of indirection, one for selecting between the shadow register sets and another for determining whether DE and HL have been swapped in the currently active register set.

(Note that Fuse, the Free Unix Spectrum Emulator, has a Perl script to generate the big 256-case switch statements, so as to avoid duplication of source code and any possible inconsistencies between cases (e.g. a typo affecting the emulation of an instruction for only one particular register). This script is pre-run before any source release, so developers running on Windows need not be too concerned unless they plan on modifying the Z80 core itself, and contributing their changes back.)

By swapping registers when emulating EX AF,AF′, EX DE,HL and EXX, we avoid all this extra expense when handling the majority of instructions, and incur a degree of additional overhead for the exchange instructions themselves. Programs are highly unlikely to use the exchange instructions frequently enough for this to be a net slowdown, and even if they did, performance would still be adequate as the cost of actually swapping the registers is not especially high. It’s pretty much a no-brainer: almost every Z80 emulator out there does it this way, and for good reason. I even did it this way in my Z80 front-end for QEMU because it was the simplest approach, because I had a few unresolved questions regarding the use of register naming, and because I'm not sure how well that patent I mentioned applies to a dynamic translator. (I suspect it doesn't, but IANAL. And ‘Software!’...)

So, why even consider the alternative? Well, for an emulator built upon dynamic translation, such as QEMU, the cost of implementing register renaming as is actually implemented by the Z80 will only be incurred at translation-time (i.e. the first time the code is run), and not at execution-time (i.e. every time the code is run). What’s more, the conventional approach may incur a cost at execution-time which could be saved through the use of register renaming.

Within a translation block (only one tiny part of the program is translated at a time, and translation is performed only when the code actually needs to execute), TCG will track the use of registers to hold results and perform its own (implicit) register renaming similar to that of many modern CPUs (rather than the explicit renaming through the use of an instruction, used by the Z80), so an instruction sequence such as ‘EXX; SBC HL,BC; EXX’ can easily be transformed into the hypothetical SBC HL′,BC′ instruction. Problem solved, then? EXX just boils away so we can stick with the conventional approach? No, as it turns out. This example only works because we swap the register sets back again before the end of the translation block. If we didn’t, QEMU would still need to perform the full exchange operation at some point within the block of code that is generated.

Register renaming would be implemented within the QEMU Z80 target as an extra level of indirection during translation, that applies when decoding Z80 instructions. The four separate flip-flops that Ken describes would become four separate flags as part of our disassembly state, which would indicate which set of AF registers is in use, which set of BC/DE/HL registers is in use, and whether DE and HL have been swapped for each of the two register sets. If these flags are grouped together, we would then have register translation tables of sixteen entries, which would be indexed by the current ‘exchange state’. Some of these could be combined with the existing register pair lookup tables (see regpair[] and regpair2[] in translate.c), which would simply gain an additional index. The EX and EXX instructions would then simply toggle bits in the disassembly state. EXX would then either need to swap the DE/HE flip-flops, or EX DE,HL could perform slightly more work in determining which flip-flop to modify.

The slight pit-fall is that if a single EXX appears within a block of code, and that code loops, iterations will alternate in their renaming of the BC/DE/HL registers. To the code translator, this is now considered to be different code which must be translated again. In the somewhat pathological but quite plausible case of going through all sixteen exchange states for large parts of the program, we now end up going through dynamic translation sixteen times instead of just once. We also take a hit in performance in executing previously translated code blocks, as the TB lookup table will have sixteen times as many entries, and our chances of finding a match in the TB lookup cache are much reduced.

Many programs may benefit from renaming without running through all of the exchange states, and even for those that do, the extra compilation may turn out to be worth the expense. The only real way to find out is to try it, but it is also worth considering a hybrid of the two approaches. EX AF,AF′ and EX DE,HL are both short enough operations that we might simply choose to emulate them in the conventional manner. As EXX operates on three register pairs at a time, we might limit renaming to this single exchange. In the worst case, we then only generate twice as much code, but we would gain an advantage in programs that use each register set for separate purposes but still switch between them frequently.

Emulation of Z80 register sets

Way back in 2007, I wrote a Z80 emulation front-end for QEMU. Being based upon dynamic translation, QEMU has a set of front-ends for allowing it to emulate a range of CPUs. The front-ends decode portions of a program’s machine code the first time they run, and make calls to the code generator, which was ‘dyngen’ at the time that I started, but is now ‘TCG’ (Fabrice Bellard’s ‘Tiny Code Generator’).

TCG also requires a back-end for generating machine code for the system upon which the emulator is to be run. For most people, this is probably some variety of x86 (32-bit or 64-bit), although hopefully, in future, the ARM and MIPS back-ends will see more use. Writing a back-end is non-trivial, but a front-end is typically a far more substantial piece of software. (Whilst a TCG back-end for the Z80 is quite feasible, porting QEMU as a whole would be little interest due to a lack of sufficiently powerful Z80-based hardware.) Dyngen was a most amazing hack that ‘worked’ by taking pre-compiled template functions, chopping their prologues and epilogues (entry/exit code) and then simply pasting these together to generate code. Dyngen’s back-end was quite simply GCC itself. Unfortunately, this approach was fragile, and so dyngen was replaced with TCG to allow support for GCC 4, which had changed the way function prologues and epilogues were generated.

When writing the Z80 target, I was faced with various questions as to how to best model the Z80’s register set, which I have still not resolved. The Z80 allows its 8-bit registers to be accessed as register pairs, of 16 bits in length. The B and C registers can be paired together as BC, the D and E registers as DE, the H and L registers as HL, IXh and IXl as IX and IYh and IYl as IY. The A (accumulator) and F (flags) registers can also be accessed as the register pair AF, and sometimes it is useful or necessary to do this (e.g. there is a PUSH AF instruction, but not such thing as PUSH A). The Z80 also has a shadow register set, and register pairs in the shadow set can be ‘swapped’ with those in the main set using the EX AF,AF′ and EXX instructions. I’ll discuss these shadow registers in a subsequent post but for now, I’ll concentrate on the question of whether to model the registers as 8-bit registers or as 16-bit register pairs.

For most operations that use or update the value in AF, I decided to model them as individual registers. Flags are often set independently of the accumulator, so if we modelled the two as a single register pair, we would incur extra processing to preserve the value of A when updating AF. That said, the flags are often only partially updated, so partial updates would simply need to be extended to preserve the full value of A. What complicates matters is that many flag updates can be optimised out entirely, where it can be shown that the flag values that would be calculated will simply be replaced with entirely new values without being checked or stored in any way, as QEMU already implements for x86. The question that is in need of answering here is whether this system of flag optimisation could be sensibly updated to work with an AF register pair instead of the individual F register. One concern is that the regular updates of the A register that occur in pretty much any Z80-based program could slow down the flag optimisation pass during code generation sufficiently to outweigh any benefit in handling AF as a register pair, but there are other detrimental effects that this could have such as poor interaction with TCG’s optimiser.

I modelled the remaining register pairs, BC, DE, HL, IX and IY, as 16-bit register pairs rather than as individual 8-bit registers, as it seemed to me that they would be used frequently enough as pairs for this to be worthwhile, but I made sure that it was a simple change to a #define to switch to handling them as 8-bit registers, so that I could properly evaluate this later on. Operations such as INC H can be easily implemented as HL = HL + 256, so there is some hope for this approach.

That said, there is the question as to how TCG’s optimiser would deal with setting of high and low parts of a register through a sequence such as ‘XOR A; LD D,A; LD E,A; LD H,A; LD L,A’. ‘XOR A’ sets A to be the exclusive-or of itself with itself, i.e. 0, as a single-byte instruction (with the side-effect of updating F). We then set the D, E, H and L registers individually with single-byte instructions. There isn't an ‘LD HL,DE’ instruction although it would make a lot of sense as a synthetic instruction, although you’d have to keep in mind that it could be interrupted mid-way. Even without handling ‘XOR A’ a special case in the front-end, TCG’s constant folding pass will recognise that the instruction sets A to a constant value of 0. This then makes its way through TCG’s constant propagation pass. Ignoring flags (which will often be optimised out), we would have something like this: ‘A = 0; D = 0; E = 0; H = 0; L = 0’. With register pairs, this might expand to something like: ‘A = 0; DE = (((DE & 0x00ff) | (0 << 8)) & 0xff00 | 0); HL = (((HL & 0x00ff) | (0 << 8)) & 0xff00) | 0’ which then would be optimised into: ‘A = 0; DE = 0; HL = 0’.

If we were to load any value other than a 0, or if TCG were unable to identify the value in A as being a constant (say, if the XOR A instruction happened in a different block of code, perhaps before a conditional jump) then I would expect that ‘LD D,H; LD E,L’ could easily turn into ‘DE = (((DE & 0x00ff) | (H << 8)) & 0xff00) | L’. It would probably not be too hard to train the optimiser into recognising the set of bits within a TCG variable that are constant, rather than handling variables as being fully constant or fully variable, but it’s likely that this would be of marginal benefit for a lot of architectures. Still, 16-bit x86 allows access to 8-bit AH and AL values (even if the top 16-bits isn’t so easily accessed in 32-bit x86, nor the top 32-bits in 64-bit x86), so the Z80 is by no means alone in this.