Tuesday, 6 January 2015

GCC is wonderful: diagnostic options you should use

In my previous post, I explored how a few extensions to C provided by GCC (and Clang) can be used to implement a better ARRAY_SIZE macro, with type-safety. There are other wonderful things about GCC, though, including the set of compiler warning and error options that it provides. If you thought that ‘-Wall’ enabled all warnings, you may wish to take a closer look, as you’ll otherwise be missing out on many useful checks that could save you a lot of time and effort.

Of course, it’s not a simple case of enabling all the warning options, making sure nothing gets printed and hey presto (or ‘Abracadabra’), you have perfect code. I would liken using a compiler’s diagnostic options to tying up one’s shoelaces: it won’t guarantee that you won’t trip, and it certainly won’t protect you if you walk out into the street without looking, but it’s still a fairly sensible idea!

For the C language family, these diagnostic options can be divided into a few categories:
  • Warnings about the use of old-style ‘K&R’ or ‘traditional’ (pre-ANSI/ISO) C. Almost any C compiler worth caring about supports C90 (or more formally, ISO/IEC 9899:1990 and briefly ANSI X3.159-1989) or newer, these days. Unless you depend upon libraries which have traditional-style declarations in their headers (either accidentally or deliberately).
  • Warnings about poor practices that are simply best avoided.
  • Warnings about potential issues that one might not always choose to address, as these could sometimes be considered as false positives.
Quite a few of these tie in with common misconceptions about C, some of which are equally applicable to C++. GCC also has warning and error options for C++, Objective C and Objective C++ which are to be recommended, although I shall not cover them here.

Starting with GCC 4.2, any individual warning option can be promoted to an error. Prior to this, only a few specific warning options had an equivalent error option, although all enabled warnings could be turned into errors with a single option (-Werror), if desired. There are many situations where it is beneficial to allow false positives to be raised by the compiler without causing compilation to fail (such as many of the sub-options specified by -Wextra and a few by -Wall), and it’s certainly nice to avoid causing unnecessary pain to package maintainers and to users building your code from source, if yours is an open source project.

What follows is a run through of some of these options, together with some examples of the sorts of issues that they will help you to avoid. To begin with, I’ll focus on several options that you may wish to enable as early on in a project as possible, as they are the most likely ones to affect header files, and are the strongest candidates for turning into errors. I will then move on to discuss other useful warning options, before briefly mentioning one option that’s not quite as useful as one might otherwise imagine.

-Wwrite-strings

The -Wwrite-strings option isn’t actually a warning option at all, but masquerades as one: its effect is to modify the language, altering the type that is used for string literals in C from ‘char *’ to ‘const char *’, as is used by C++. No additional option is needed to produce warnings when implicitly dropping the const qualifier, as the messages that GCC emits for this are enabled by default.

You may wonder why on earth C uses ‘char *’ in the first place, especially given that string literals are not typically (at least on systems that use virtual memory) held in memory that is writeable. My suspicion here is that an awful lot of code had already been written by the time the const keyword was added, and it was deemed too big a task to modify all of it (there’d certainly be some precedence for this). To avoid getting stuck in the same mess, I would advise any new project to start off using this option.

If you’re writing a library, you should make sure at a minimum to use const anywhere in your headers where a string literal may be used, as you’ll make a mess for those writing applications in C++, otherwise. If your test code is written in C (and not C++), make sure it uses -Wwrite-strings to help check for missing const qualifiers.

In many established projects, you may find you get an awful lot of noise from a very small set of functions, which is likely to include wrappers around printf() and family and wrappers around write() or fwrite(). Once you’ve dealt with these, the warnings will start to relate a little more to potentially risky situations, and you might even find a situation where an attempt could be made to modify or call free() upon a string literal, perhaps in some of the less well-tested error paths of the codebase. Once you’ve got into the habit of using const throughout your codebase (to do otherwise would produce warnings when compiling your test code) you may find that it will help to catch a number of issues. Supposing we have a function for parsing IPv6 addresses, declared as:
bool read_ipv6_addr(uint8_t *ipv6_bin, const char *str);
We will have been forced into adding the const qualifier to str as shown above when writing our test code:
uint8_t ipv6_addr_buf[16];
static const uint8_t ipv6_addr_localhost[16] =
    { 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,1 };
CU_ASSERT(read_ipv6_addr(ipv6_addr_buf, "::1"));
CU_ASSERT(!memcmp(ipv6_addr_buf, ipv6_addr_localhost));
Provided we otherwise maintain good discipline and introduce const qualifiers where we know a string will never need to be modified, we can entirely avoid ever accidentally transposing the arguments to read_ipv6_addr(), and the marathon of a debugging session that might otherwise arise from such a trivial mistake.

-Werror=implicit-function-declaration

My suggestion is to turn this into an error, although you could enable it just as a warning, as -Wimplicit-function-declaration. An implicit function declaration occurs at the earliest place where a function is called within a C source file, if the compiler has not already parsed a function declaration (or definition). What this typically means is that you forgot to include the header with the prototype for the function, so the compiler assumes you’ve called it correctly. It works out what the function’s parameters are from the arguments that have been passed to it, and then checks that the arguments of any subsequent calls within the file match the parameters of the implicit declaration. Either that, or you have included the correct header file, and the function prototype is entirely missing from it. In any case, the behaviour is the same: if the function takes a pointer and you’ve forgotten to pass it, the likelihood is one of an instant segfault, and if the code is called infrequently, there’s every chance that this will slip through testing.

Implicit function declarations are so nasty that GCC had an option to flag them as errors long before the release of GCC 4.2 (going back at least as far as GCC 2.95), although it was named -Werror-implicit-function-declaration, and this older name for the option is still supported in current versions of GCC. If you don’t care too much about GCC 4.1 and older, and if you plan on using any of the other -Werror= options described below, you may just as well use the newer syntax consistently.

C++ doesn’t allow implicit function declarations and that’s probably a fair indicator that you don’t want to allow them when using C, either. If you’re using a library function that someone forgot to include in the header files, you will just have to settle for declaring your own prototype, although it’s probably best to do this in a wrapper around the library’s own header file. This can get a little messy, but it’s still better than relying upon implicit declarations.

-Werror=strict-prototypes

This option will catch function declarations that are not ANSI C prototypes. If the compiler catches one of these, that means one of two things. Either the function takes no parameters but this has not been properly indicated where the function is declared, or worse, the function does take parameters but these have not been listed in the declaration. We’re basically back to the same behaviour as with implicit function declarations, where the compiler works on the basis of the earliest call to the function within the source file (or the function’s definition, if that happens to be found first).

This tends to throw C++ programmers who are used to thinking of C as a subset of C++, but yes, an empty parameter list in C means ‘don’t check’, whereas in C++ it is equivalent to writing void. If you specify void for the parameter list, the compiler can check to make sure you’re not mistakenly passing parameters to a function that takes none in the belief that it will use them. If the function does actually take parameters, the compiler will detect that the function’s definition (i.e. the function itself) does not match its prototype and will throw a compilation error.

Basically, unless you’re using C++:
int foo();     /* WRONG: K&R declaration, args not checked */
int foo(void); /* RIGHT: ANSI C prototype */
-Werror=missing-prototypes

This catches cases where a function has been defined globally (i.e. without the static keyword) but the prototype for the function has not been found. If the function is only called locally, you can simply mark it as static and move on. Otherwise, you have either forgotten to include the header in which it is declared, or you have included the correct header but again, it has not been declared within that header. Either way, the result is the same: the compiler cannot check that the function prototype and function definition actually match, and so it cannot provide any guarantee that the function is being called correctly.

-Wformat -Werror=format-security

Certain incorrect uses of the printf() family of functions can remarkably result in security vulnerabilities. Even if security is not a direct concern of yours (perhaps all input that your program will receive can be safely trusted), this will still help to catch some common mistakes. -Wformat itself will check that the argument list matches the format string, and is included as part of -Wall. However, these warnings can sometimes be benign, so I wouldn’t strictly recommend turning them into errors in all cases. The more severe issues to look out for are those where too few arguments are passed, where an integral type is passed but a char * is expected for use with %s, or the wrong type of pointer has been passed for use with the %s specifier.

-Werror=format-security is another matter, in that it specifically checks for situations where the format string is not a string literal, and no format arguments are passed. In other words, printf(str) has been used where printf("%s", str) or perhaps puts(str) would do just as well. In cases where str is not from trusted source, this can result in a uncontrolled format string vulnerability, allowing arbitrary code to be executed when a maliciously constructed string is passed to the program.

When -Wformat is used, GCC can also check that any wrappers for functions in the printf() family have been called correctly, provided their prototypes have been annotated with the format function attribute, which I shall demonstrate in a later post concerning the language extensions that GCC provides, including function, variable and type attributes.

-Wall

This is an alias for a whole load of other warning options which catch a whole load of common mistakes, and comes strongly recommended, although I would not suggest turning these into errors, simply because this option includes -Wmaybe-uninitialized which can produce false positives. Actually, the -Wmaybe-uninitialized option was only introduced in GCC 4.7, but in older versions these same false positives are produced with -Wuninitialized, which is also included in -Wall. With GCC 4.7 and later, you could perhaps use -Werror=all -Wno-error=maybe-uninitialized but there is a fair likelihood that it may be inconvenient for you to have GCC treat other warnings that are enabled by -Wall as errors. It it worth keeping in mind that the checking for certain options included in -Wall varies depending on the level of code optimisation in use, and that the precise details of this and the compiler’s propensity to report false positives will vary between versions.

-Werror=parentheses

This helps catch cases where missing brackets or braces affect the behaviour of a program, and is included (as a warning) by -Wall. This actually checks for quite a few different omissions, but the main concerns are the accidental use of the assignment operator (=) instead of the equality operator (==) where a truth value is expected, and common mistakes relating to C’s operator precedence, which as Dennis Ritchie had admitted is not ideal.

-Werror=undef

This causes preprocessing to fail if an undefined identifier is used within a ‘#if’ directive, rather than simply substituting 0 and carrying on as if nothing happened. This is almost always due to a missing include or a missing macro definition, which is probably not something you want ignoring. One great thing about this option is it allows you to switch away from using #ifdef with little confidence when checking a definition which you do not currently have enabled (where you could easily have made a typo, or failed to include the appropriate header) to using #if with complete assurance that if a feature is disabled, the checking of that feature is correct and an explicit value of ‘0’ has been set.

One option to avoid

The -Wshadow option is just one option that is somewhat less than useful. The tl;dr of the argument that Linus makes: do you really want a warning (or even compilation failure) whenever you define a local variable named index and you just happen to have included <strings.h>?

In closing

There are quite a few more options that you may wish to use, and my recommendation would be to read the Warning Options page of the GCC manual, if you want to know more.

glibc users may find that they get stricter warnings regarding use of the certain C library functions (including unused return values from read()) if they compile with -D_FORTIFY_SOURCE=2, although Ubuntu enables fortification by default.

If you’re using GCC 4.1, 4.2, 4.3, 4.4 or 4.5, you may also find the -fdiagnostics-show-option useful: at the end of each warning message that is printed, it will add the name of the warning option that controls that message, allowing you to look up the entry for that option in GCC’s manual if you wish, but also allowing you an easier way of determining which option to disable if you require. This became the default in GCC 4.6. (Also, ignore the typo in the documentation for it in GCC 4.1: the option name given here is the correct one.)

If you’re using GCC 4.9, you might chose to enable colourisation of GCC’s diagnostic messages. One way is to simply set the GCC_COLORS environment variable to ‘auto’ (meaning that colourisation will occur when GCC’s output is to a terminal). In GCC 5.0, ‘auto’ will be the default.

There are various static code analysis tools that are worth considering, such as Sparse and Coccinelle (with an appropriate set of semantic match files), and there’s a good deal of run-time analysis that can be helpful, too, e.g. AddressSanitizer, ThreadSanitizer and MemorySanitizer and the venerable Valgrind.

Sunday, 4 January 2015

GCC *is* wonderful: a better ARRAY_SIZE macro

One of the best patch submissions I’ve ever seen is Rusty Russell’s change to Linux’s ARRAY_SIZE macro. For those that haven’t programmed much in C, I’ll briefly explain that the language provides no built-in construct for determining the number of elements in an array. However, it does allow one to determine the size of an array in bytes (well, chars, actually) and also the size of the array’s elements (also in chars) and you can then find the number of array elements by dividing the former by the latter.

This works reasonably well — well enough that to this day, C still does not include a language construct for determining array element counts. However, there’s disagreement as to what to call the macro for doing this, with Unix programmers typically preferring ARRAY_SIZE, Windows programmers sometimes preferring DIM, and some programmers preferring NELEMS or something else entirely. What’s worse, sometimes a macro isn't used at all, leaving the calculation open-coded. Occasionally, when an open-coded calculation is used, the element size is written as a literal value instead of using C’s sizeof keyword, or the element type is explicitly specified instead of letting the compiler work that out itself.

Okay, so maybe that sounds a little cumbersome but not especially bad, right? Actually, yes — the ARRAY_SIZE macro as conventionally defined does not provide good type-safety and the open-coded approach is more fragile, more verbose and provides no improvement in type-safety.

Type-safety of the conventional ARRAY_SIZE macro

The macro is usually defined as follows:
#define ARRAY_SIZE(array) \
    (sizeof(array) / sizeof(*array))
or:
#define ARRAY_SIZE(array) \
    (sizeof(array) / sizeof(array[0]))
As C uses the same syntax for pointer dereferencing and the indexing of arrays, these are equivalent. The macro would be used as follows:
int i;
int a[100];
for (i = 0; i < ARRAY_SIZE(a); i++) {
    a[i] = i;
}
Some C programmers presumably feel that the version of the definition that uses ‘[0]’ better conveys that the parameter to the macro is to be an array. On the contrary, I feel that ‘*’ better conveys that the parameter could be a pointer. Of course, if the parameter is a pointer, the calculation is unlikely to produce the desired value. For example:
int i;
int a[100];
int *p = a;
for (i = 0; i < ARRAY_SIZE(p); i++) { /* WRONG */
    a[i] = i;
}
ARRAY_SIZE(p) expands to sizeof(p) / sizeof(*p). Although sizeof(*p) provides the expected value, sizeof(p) is equivalent to sizeof(int *), and not sizeof(int[10]) as intended. On most general-purpose platforms in use today, the size of any pointer type will be either four chars (for 32-bit systems) or eight chars (for 64-bit systems). Depending on the array’s element size, the result of the integer division will be either 8, 4, 2, 1 or 0, regardless of the number of elements. In the above example, the vast majority of the elements in the array will be left uninitialised.

Competent C programmers are therefore somewhat careful to ensure that they never use ARRAY_SIZE directly with a pointer variable, and are also careful in cases where they might legitimately use the ARRAY_SIZE macro to get the size of an array that is the member of a structure that is referenced through a pointer, as shown:
int i;
struct foo {
    int a[100];
} bar;
struct foo *p = &bar;
for (i = 0; i < ARRAY_SIZE(p->a); i++) {
    p->a[i] = i;
}
It’s also worth considering the fragility of the open-coded approach, but only because it’s still frightfully common. Here, if the array were to change from an array of int to an array of long int, we would need to update the open-coded calculation to suit:
int i;
int a[100];
for (i = 0; i < sizeof(a) / sizeof(int); i++) { /* UGH */
    a[i] = i;
}
There’s simply no reason to burden the programmer with this task, especially when once every so often a failure to do this will result in a severe bug.

One danger of the example above, where the array member of p->a is used, is that if the struct were to be changed to include a pointer, instead of an array, the code would still compile without modification. Note that open-coding of the calculation does nothing to prevent this.

Rusty Russell’s patch to Linux’s ARRAY_SIZE macro

There are at least three awesome things about Rusty’s patch. These are:
  1. The functionality. With this version of ARRAY_SIZE, one never needs to worry about accidentally taking the array size of a pointer, as any code doing this would quite simply fail to compile when using a sufficiently recent version of GCC (such as GCC 3.1, released on 27th July 2002) or Clang. The error message is somewhat opaque but it’s better than the bug that would otherwise result.
  2. The subject: ‘[PATCH] Use more gcc extensions in the Linux headers’. It’s brilliantly attention-getting, but also conveys quite well that the submission was more of a request for comment than anything else, at that early stage. There’s a sense of humour about this which is wonderfully disarming, and what followed was a short discussion and a far tidier patch which was accepted soon afterwards. The follow-up patch addressing the lack of documentation is worth a quick mention, too: it simply added the comment /* GCC is awesome. */ — of course, this too was only intended a bit of light humour, and the eventual change provides a far better explanation.
  3. The directness of the original submission. Rather than to attempt to explain it, Rusty simply leaves it as a challenge to the reader! I wouldn’t suggest accepting code in this form, but I found it interesting to read his original patch and to make sense of the change. Oddly, it was probably easier than it would have been to make sense of anybody else’s explanation, or to look at the eventual implementation of the __must_be_array macro.
Rusty’s version of the ARRAY_SIZE macro, with the added ‘documentation’ (and with some whitespace changes), is as follows:
/* GCC is awesome. */
#define ARRAY_SIZE(arr) \
    (sizeof(arr) / sizeof((arr)[0]) \
     + sizeof(typeof(int[1 - 2 * \
           !!__builtin_types_compatible_p(typeof(arr), \
                 typeof(&arr[0]))])) * 0)
I won’t explain this in full, so as to provide some of the same challenge to you, except to say:
  • The initial division is the same as before, the addend must equal zero to leave this behaviour unchanged.
  • The type of the addend must be size_t (as provided by sizeof) for all expansions of the macro to retain the type of size_t.
  • typeof(int[-1]) is an invalid expression under GCC (as is ‘sizeof(struct { int : -1; })’, upon which current kernels rely).
  • typeof and __builtin_types_compatible_p are both GCC extensions, as you may have guessed!

Usable ARRAY_SIZE macro with type-safety under GCC

My only question: why should we have to go to so much trouble? Well, I’m not sure how far fair-use extends (as these macros are QEMU and Linux, which are distributed under the GPLv2) and IANAL, but this may help:
#if defined(__GNUC__) && defined(__GNUC_MINOR__)
 #define GNUC_VERSION \
     (__GNUC__ << 16) + __GNUC_MINOR__
 #define GNUC_PREREQ(maj, min) \
     (GNUC_VERSION >= ((maj) << 16) + (min))
#else
 #define GNUC_PREREQ(maj, min) 0
#endif
#define BUILD_BUG_ON_ZERO(e) \
    (sizeof(struct { int:-!!(e)*1234; }))
#if GNUC_PREREQ(3, 1)
 #define SAME_TYPE(a, b) \
     __builtin_types_compatible_p(typeof(a), typeof(b))
 #define MUST_BE_ARRAY(a) \
     BUILD_BUG_ON_ZERO(SAME_TYPE((a), &(*a)))
#else
 #define MUST_BE_ARRAY(a) \
     BUILD_BUG_ON_ZERO(sizeof(a) % sizeof(*a))
#endif
#ifdef __cplusplus
 template <typename T, size_t N>
 char ( &ARRAY_SIZE_HELPER( T (&array)[N] ))[N];
 #define ARRAY_SIZE( array ) \
      (sizeof( ARRAY_SIZE_HELPER( array ) ))
#else
 #define ARRAY_SIZE(a) ( \
     (sizeof(a) / sizeof(*a)) \
     + MUST_BE_ARRAY(a))
#endif
With this version, if you take the ARRAY_SIZE of a pointer, GCC 4.9 will give you the error message:
error: negative width in bit-field ‘<anonymous>’
This will also work for Clang (as it curiously defines __GNUC__ and __GNUC_MINOR__):
error: anonymous bit-field has negative width (-1234)
The magic value of −1234 is intended as something of a give-away as to the cause of the error, should it occur, assuming that some poor soul will be shown this line on its own without the subsequent lines of output that backtrack through successive layers of macro expansion. At least this way, you would have something to grep for!

I’ve added the nice hack from Chromium’s version of the macro that will provide somewhat patchy checking for compilers without __builtin_types_compatible_p or typeof. It may be worth noting here that the approach Chromium takes is to use ‘0[a]’ instead of ‘*a’, exploiting the equivalence between pointer dereferencing and array indexing in C (a[i] is equivalent to *(a+i)and therefore i[a]) but catching operator overloading (using an operator[] method to handle subscripting for a class) as well.

Instead of adopting Chromium’s trick, I’ve also included a portable version for C++ in the above, although this has the disadvantage that it cannot be used with types defined within a function.

What would actually make sense would be for C and C++ (and Objective C, assuming that’s also affected... and let’s just pretend Objective C++ doesn’t exist) to be extended to include a keyword or some syntax for this. In lieu of any provision for this in the standards, I feel that it would make good sense for compilers to add this as an extension. In GCC and Clang, I imagine a __builtin_arraysize function would probably make a whole lot of sense. Yes, I know things are a lot better in C++, but there may still be times when you can’t escape from its C roots, such as when writing language bindings, or when using a library written in C.

Really, this should have been fixed thirty years ago!

Monday, 29 December 2014

Shellshock and Bash

This has been discussed pretty widely already, but I include here my views on ShellShock that I made in a private post to Facebook on 26th September 2014. It might seem loosely worded in places but keep in mind that this is intended to be read by an audience of mixed technical ability. My own recollection is of quite a lot of panic and some amount of technical discussion of the problem, but very little discussion of the practices that allowed this to happen.

Firstly, from my initial posting upon finding out about the bug, a short comment on the state of the software industry:
This ‘ShellShock’ bug that’s got the net in a panic is amazing. It’s 22 years old, affects sites all over the Internet and it’s almost unthinkable that it could exist. The reason is the software that’s affected, Bash, is so cumbersome that hardly anyone has bothered to study it. I might start posting more about the sorry state of the software industry (and the surveillance industry), but I’ll try not to make it seem like ‘boring technical stuff’. This is about governance of infrastructure that affects us all. If your water, gas or electric supply were so badly mismanaged, you’d all be up in arms!
 Then, a look into Bash and Shellshock:
What is Bash? Well the name is from the appalling geek humour: “Bourne-Again SHell”. Steve Bourne was the person at Bell Labs who wrote the original shell, and Bash is a rewrite with some extra bits and pieces. The original was okay, but the rewrite has been widely criticised as overly complex and slow. There’s been some pressure to replace it, but this has been resisted by those preferring to take the “easy option”. Many will not have learnt from this that they were taking the hard option all along, and will continue to make the same mistakes.
In computing, a “shell” is just the part of the user interface that comes built into the system. In Windows this would include your Taskbar and Explorer and on a Mac, it’d include the Dock and Finder. This also includes the “open file”, “save file”, “print”, “are you sure?” boxes, and that sort of thing. 
The shell we’re talking about here is much simpler than this. You type the name of a program, press enter, and it runs it. From MS-DOS, maybe you’ve heard of “COMMAND.COM”, or perhaps you’ve seen “CMD.EXE” in Windows. They’re basically the same thing. 
That sounds like a pretty simple thing, right? Well, sure, but it doesn’t sound like a very friendly user interface, does it? Well, it isn’t. As a first step, we might want to run some sequence of programs and have that sequence stored in a list (such as “copy music from CD” and then “add files to music database”). Then again, we may not always want exactly the same sequence in every situation (maybe if I have my MP3 player attached, I want my music copying there, too). Before we even know it, we’ve ended up with what is really just a very badly designed programming language, and not the simple shell that was originally intended. 
This is where it all goes horribly wrong. Due to an almighty ‘oversight’, what is mainly used as a very simple program for running other programs will actually look for instructions to obey in any of the settings that are supplied to it. These settings (which we call “environment variables”) might well be something that a user has typed in to a web page before hitting “submit”, so where a web server just needs to run the shell to send off an email, which might need to be given NAME and ADDRESS settings to compose the mail, there might be nothing stopping the the user from entering their name as “() {:;}; rm -fr /” (define a subroutine that does nothing, and then delete all files on the system). 
We seem to be caught in a combination of two attitudes: 
  • Conservatism: “Don’t fix it if it’s working, even if it’s probably broken and we just don’t know it yet. We can’t afford to make changes that might have adverse effects [especially if this disrupts our customers at all]”.
  • Recklessness: “We need more cool features and we can always fix it later. Our programming isn’t confusing, you’re just not very good at understanding it”.
Two of the few systems that have replaced Bash with something simpler are Debian, which is a community-run project without commercial concerns, and Ubuntu, upon which it is based. It's interesting that they would take the plunge and advise some caution about upgrading, whereas Apple (Mac OS X), Red Hat, openSUSE and various other systems have not made this decision. 
Much of the problem with Bash would have been easily avoided if its extra behaviour was disabled unless specifically asked for, as we could then have switched to something simpler with much greater confidence. In fact, some means of allowing Bash to throw up warnings when this extra behaviour is used might not be a bad thing to have. I’ve had to get rid of it from systems during my career and it has been frustrating to remove the ‘Bashisms’ that programmers leave in without knowing or caring. 
An especially frustrating part was dealing with software given to us by suppliers, knowing full well that my counterparts in other companies would be going through the same frustrations, not being allowed to spare them the frustration by working with the supplier and knowing that the supplier would not be terribly interested or responsive in dealing with this themselves or even in accepting changes that I would provide to them.
A friend of mine pointed out that the BSDs now avoid using Bash as a system shell, but that only newer versions (and not older versions) of OS X use Bash. My view: “To change their system shell to Bash seems a moronic thing for Apple to have done.”

As an aside, anyone who has even the slightest level of familiarity with C who hasn’t seen the original Bourne Shell source code should read about it now — but make sure that you have some Optrex and Mindbleach to hand, first, though! Steve Bourne used C macros to #define C into something that can at best be described as a cross between C, shell script and Algol. I suppose somebody had to do it first so that the rest of us could learn not to.

Emulation of Z80 shadow registers

Following on from my previous posting, where I discussed the question of how to model the Z80’s register pairs in QEMU, an emulator using dynamic code translation, this post raises the question of how to handle the Z80’s shadow register set. Firstly, I refer to Ken Shiriff’s blog post on how the Z80’s register file is implemented:
Note that there are two A registers and two F registers, along with two of BC, DE, and HL. The Z80 is described as having a main register set (AF, BC, DE, and HL) and an alternate register set (A′F′, B′C′, D′E′, and H′L′), and this is how the architecture diagram earlier is drawn. It turns out, though, that this is not how the Z80 is actually implemented. There isn’t a main register set and an alternate register set. Instead, there are two of each register and either one can be the main or alternate.
The shadow registers can be quickly swapped with main ones by means of the EX AF,AF′ and EXX instructions. EXX swaps BC with BC′, DE with DE′ and HL with HL′ but does not swap AF with AF′. (This is even the source of a bug affecting MGT’s DISCiPLE interface, which would fail to properly save or restore the AF′ register when returning to a program). The IX and IY registers are not shadowed. There’s also an EX DE,HL instruction which swaps DE and HL, but does not swap DE′ and HL′ (i.e. it only swaps DE and HL in the register set you’re currently using).

In a conventional emulator, you’d implement EX and EXX by swapping the main register set with the shadow set as might be suggested by many ‘logical’ descriptions of those instructions, rather than modelling the Z80’s approach of using register renaming through the use of four flip-flops, as described by Ken. There are a few reasons for doing it this way. Firstly, there’s a nasty patent out there which prohibits using pointers (or references) to switch between alternate register sets. Take something that’s been commonplace in hardware for donkey’s years and sprinkle the word ‘software’ on it (and maybe ‘pointer’ in place of ‘multiplexor’) and BANG — instant patent... anyway, I digress.

The other reason for this approach is one of performance. Typically a Z80 emulator will have separate code for handling every possible form of an instruction. That is to say, INC BC and INC DE are treated as completely unrelated instructions, even though their implementations are almost identical. The main loop of the emulator typically consists of a massive switch() statement with cases for the 256 possible byte values that might be read from at the address held in the program counter. We perform instruction decoding and register decoding with a single switch() statement, which most likely is compiled into a jump table. The last thing we would want to do is introduce unnecessary indirection in determining which register set to access. Note that for DE and HL, we would need two levels of indirection, one for selecting between the shadow register sets and another for determining whether DE and HL have been swapped in the currently active register set.

(Note that Fuse, the Free Unix Spectrum Emulator, has a Perl script to generate the big 256-case switch statements, so as to avoid duplication of source code and any possible inconsistencies between cases (e.g. a typo affecting the emulation of an instruction for only one particular register). This script is pre-run before any source release, so developers running on Windows need not be too concerned unless they plan on modifying the Z80 core itself, and contributing their changes back.)

By swapping registers when emulating EX AF,AF′, EX DE,HL and EXX, we avoid all this extra expense when handling the majority of instructions, and incur a degree of additional overhead for the exchange instructions themselves. Programs are highly unlikely to use the exchange instructions frequently enough for this to be a net slowdown, and even if they did, performance would still be adequate as the cost of actually swapping the registers is not especially high. It’s pretty much a no-brainer: almost every Z80 emulator out there does it this way, and for good reason. I even did it this way in my Z80 front-end for QEMU because it was the simplest approach, because I had a few unresolved questions regarding the use of register naming, and because I'm not sure how well that patent I mentioned applies to a dynamic translator. (I suspect it doesn't, but IANAL. And ‘Software!’...)

So, why even consider the alternative? Well, for an emulator built upon dynamic translation, such as QEMU, the cost of implementing register renaming as is actually implemented by the Z80 will only be incurred at translation-time (i.e. the first time the code is run), and not at execution-time (i.e. every time the code is run). What’s more, the conventional approach may incur a cost at execution-time which could be saved through the use of register renaming.

Within a translation block (only one tiny part of the program is translated at a time, and translation is performed only when the code actually needs to execute), TCG will track the use of registers to hold results and perform its own (implicit) register renaming similar to that of many modern CPUs (rather than the explicit renaming through the use of an instruction, used by the Z80), so an instruction sequence such as ‘EXX; SBC HL,BC; EXX’ can easily be transformed into the hypothetical SBC HL′,BC′ instruction. Problem solved, then? EXX just boils away so we can stick with the conventional approach? No, as it turns out. This example only works because we swap the register sets back again before the end of the translation block. If we didn’t, QEMU would still need to perform the full exchange operation at some point within the block of code that is generated.

Register renaming would be implemented within the QEMU Z80 target as an extra level of indirection during translation, that applies when decoding Z80 instructions. The four separate flip-flops that Ken describes would become four separate flags as part of our disassembly state, which would indicate which set of AF registers is in use, which set of BC/DE/HL registers is in use, and whether DE and HL have been swapped for each of the two register sets. If these flags are grouped together, we would then have register translation tables of sixteen entries, which would be indexed by the current ‘exchange state’. Some of these could be combined with the existing register pair lookup tables (see regpair[] and regpair2[] in translate.c), which would simply gain an additional index. The EX and EXX instructions would then simply toggle bits in the disassembly state. EXX would then either need to swap the DE/HE flip-flops, or EX DE,HL could perform slightly more work in determining which flip-flop to modify.

The slight pit-fall is that if a single EXX appears within a block of code, and that code loops, iterations will alternate in their renaming of the BC/DE/HL registers. To the code translator, this is now considered to be different code which must be translated again. In the somewhat pathological but quite plausible case of going through all sixteen exchange states for large parts of the program, we now end up going through dynamic translation sixteen times instead of just once. We also take a hit in performance in executing previously translated code blocks, as the TB lookup table will have sixteen times as many entries, and our chances of finding a match in the TB lookup cache are much reduced.

Many programs may benefit from renaming without running through all of the exchange states, and even for those that do, the extra compilation may turn out to be worth the expense. The only real way to find out is to try it, but it is also worth considering a hybrid of the two approaches. EX AF,AF′ and EX DE,HL are both short enough operations that we might simply choose to emulate them in the conventional manner. As EXX operates on three register pairs at a time, we might limit renaming to this single exchange. In the worst case, we then only generate twice as much code, but we would gain an advantage in programs that use each register set for separate purposes but still switch between them frequently.

Emulation of Z80 register sets

Way back in 2007, I wrote a Z80 emulation front-end for QEMU. Being based upon dynamic translation, QEMU has a set of front-ends for allowing it to emulate a range of CPUs. The front-ends decode portions of a program’s machine code the first time they run, and make calls to the code generator, which was ‘dyngen’ at the time that I started, but is now ‘TCG’ (Fabrice Bellard’s ‘Tiny Code Generator’).

TCG also requires a back-end for generating machine code for the system upon which the emulator is to be run. For most people, this is probably some variety of x86 (32-bit or 64-bit), although hopefully, in future, the ARM and MIPS back-ends will see more use. Writing a back-end is non-trivial, but a front-end is typically a far more substantial piece of software. (Whilst a TCG back-end for the Z80 is quite feasible, porting QEMU as a whole would be little interest due to a lack of sufficiently powerful Z80-based hardware.) Dyngen was a most amazing hack that ‘worked’ by taking pre-compiled template functions, chopping their prologues and epilogues (entry/exit code) and then simply pasting these together to generate code. Dyngen’s back-end was quite simply GCC itself. Unfortunately, this approach was fragile, and so dyngen was replaced with TCG to allow support for GCC 4, which had changed the way function prologues and epilogues were generated.

When writing the Z80 target, I was faced with various questions as to how to best model the Z80’s register set, which I have still not resolved. The Z80 allows its 8-bit registers to be accessed as register pairs, of 16 bits in length. The B and C registers can be paired together as BC, the D and E registers as DE, the H and L registers as HL, IXh and IXl as IX and IYh and IYl as IY. The A (accumulator) and F (flags) registers can also be accessed as the register pair AF, and sometimes it is useful or necessary to do this (e.g. there is a PUSH AF instruction, but not such thing as PUSH A). The Z80 also has a shadow register set, and register pairs in the shadow set can be ‘swapped’ with those in the main set using the EX AF,AF′ and EXX instructions. I’ll discuss these shadow registers in a subsequent post but for now, I’ll concentrate on the question of whether to model the registers as 8-bit registers or as 16-bit register pairs.

For most operations that use or update the value in AF, I decided to model them as individual registers. Flags are often set independently of the accumulator, so if we modelled the two as a single register pair, we would incur extra processing to preserve the value of A when updating AF. That said, the flags are often only partially updated, so partial updates would simply need to be extended to preserve the full value of A. What complicates matters is that many flag updates can be optimised out entirely, where it can be shown that the flag values that would be calculated will simply be replaced with entirely new values without being checked or stored in any way, as QEMU already implements for x86. The question that is in need of answering here is whether this system of flag optimisation could be sensibly updated to work with an AF register pair instead of the individual F register. One concern is that the regular updates of the A register that occur in pretty much any Z80-based program could slow down the flag optimisation pass during code generation sufficiently to outweigh any benefit in handling AF as a register pair, but there are other detrimental effects that this could have such as poor interaction with TCG’s optimiser.

I modelled the remaining register pairs, BC, DE, HL, IX and IY, as 16-bit register pairs rather than as individual 8-bit registers, as it seemed to me that they would be used frequently enough as pairs for this to be worthwhile, but I made sure that it was a simple change to a #define to switch to handling them as 8-bit registers, so that I could properly evaluate this later on. Operations such as INC H can be easily implemented as HL = HL + 256, so there is some hope for this approach.

That said, there is the question as to how TCG’s optimiser would deal with setting of high and low parts of a register through a sequence such as ‘XOR A; LD D,A; LD E,A; LD H,A; LD L,A’. ‘XOR A’ sets A to be the exclusive-or of itself with itself, i.e. 0, as a single-byte instruction (with the side-effect of updating F). We then set the D, E, H and L registers individually with single-byte instructions. There isn't an ‘LD HL,DE’ instruction although it would make a lot of sense as a synthetic instruction, although you’d have to keep in mind that it could be interrupted mid-way. Even without handling ‘XOR A’ a special case in the front-end, TCG’s constant folding pass will recognise that the instruction sets A to a constant value of 0. This then makes its way through TCG’s constant propagation pass. Ignoring flags (which will often be optimised out), we would have something like this: ‘A = 0; D = 0; E = 0; H = 0; L = 0’. With register pairs, this might expand to something like: ‘A = 0; DE = (((DE & 0x00ff) | (0 << 8)) & 0xff00 | 0); HL = (((HL & 0x00ff) | (0 << 8)) & 0xff00) | 0’ which then would be optimised into: ‘A = 0; DE = 0; HL = 0’.

If we were to load any value other than a 0, or if TCG were unable to identify the value in A as being a constant (say, if the XOR A instruction happened in a different block of code, perhaps before a conditional jump) then I would expect that ‘LD D,H; LD E,L’ could easily turn into ‘DE = (((DE & 0x00ff) | (H << 8)) & 0xff00) | L’. It would probably not be too hard to train the optimiser into recognising the set of bits within a TCG variable that are constant, rather than handling variables as being fully constant or fully variable, but it’s likely that this would be of marginal benefit for a lot of architectures. Still, 16-bit x86 allows access to 8-bit AH and AL values (even if the top 16-bits isn’t so easily accessed in 32-bit x86, nor the top 32-bits in 64-bit x86), so the Z80 is by no means alone in this.