So you want to be a cOOmpiler writer?


This is intended to be the first in an ongoing series of articles that will look closely at what makes a compiler tick. Each article will cover a different aspect of the compilation process, giving code fragments and discussing the subtleties of both the language being compiled and the compiler itself. I shall focus on four specific areas of interest:

About the author

I have been involved with language design and compiler writing for about 12 years. I started out designing "toy" languages and writing interpreters for them - functional programming was all the rage in those days - and moved on to flow analysis and code optimisation. Soon I was involved with a language translator that took C and translated it into other languages - Turbo Pascal was an early target. This led to a commercial C compiler and eventually to one of the first validated C compilers (in September 1990). Along the way, I was involved with porting a popular COBOL compiler to two new RISC architectures, and generating C code for "farms" of transputers from actuarial code. For the last three years, I have been at Programming Research, writing source code inspection tools for C and more recently C++. This series of articles is based on my experiences with C++ over the last couple of years, as a "compiler" writer, as a C++ user, as a Quality Manager and as a member of J16/WG21 - the joint C++ standards committee.

Designing and Writing an OO-compiler

From this point of view, I hope to fill you with enthusiasm for buying, rather than writing, a C++ compiler. No knowledge of compiler techniques is intended to be required - feel free to write in if you disagree! My intention is to explain why I made the decisions I made (and where I got it wrong) and what sort of mechanics are involved, both in compiling an object-oriented language, and in writing an occasionally complex piece of software like a compiler. I may even talk about code generation, but I try not to practice it for C++.

C++ Language Features and Techniques

A prior knowledge of C++ is expected, but I was learning as I went too, so I shall examine certain features of the language with both a novice's eye and critical hindsight.

Using C++ to write a compiler

The easiest way to come to terms with a language is to use it to write a compiler for that language. You'll trip over all the things that your users will, so you'll be better informed to help them - you may even add warnings to your compiler to help your users avoid the problem areas! Also, writing a compiler forces you to become very familiar with the language and all its peculiarities. C++ has a lot of those.

Parsing C++ source code

On a scale of 1 to 10 compared to other languages I have parsed in the past, C++ represents a sure-fire 11. Syntactically, it isn't too bad and traditional parsing tools and techniques can be applied, but it makes up for that in semantic complexity. Fortunately, the language itself provides helpful support for simpler solutions to some of the trickier semantic analysis issues.

Code Quality

I'm going to bare my soul and publish fragments of my own code. That means you'll get to see some of the inner working of the Programming Research product and where appropriate it will say "copyright: (c) 1994 Programming Research". I will contend that these fragments represent "good practice", and I hope that I will be able to convince you why. I will also show you "bad" fragments and explain why they are so (including a few unfortunate pieces of code that I had no choice about). Since our product detects "bad practice", I will also explain what is involved in detecting some of these abuses - it will give you a feel for what a human code inspector has to go through to do the same thing.

Standardisation: The evolving draft ISO standard

No article of mine would be complete without some discussion of the draft ISO standard. Mainly I'll focus on how the changing standard might affect you and what you can do about it.

What is a compiler?

As far as I am concerned, a compiler is any system that takes code as input and outputs a "translation" of that source code. The most common form of compiler takes a computer language (e.g., C++) and produces machine code (e.g., a .EXE file). Much of the business of compilers also applies to interpreters - they usually include a phase that compiles from the external form of the input to some internal form that can be "executed".

For the purposes of this article, though, I shall use the most common meaning, and describe what is involved in translating C++ source code into something closer to the machine. I shan't go into much depth on code generation because it's beyond the scope of an article which is essentially about C++.

Phases of Translation

I'll start by explaining what the draft ISO standard says a compiler does and point out the areas on which I shall concentrate. ISO C described the compilation process as a conceptual series of eight phases and the draft ISO C++ standard has adopted the same model. I shall quote from the relevant clause in the C++ working paper (25 January 1994 - 94-0027/N0414):

  1. "Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations."
    This is typically an identity mapping although it is often useful from a compiler's point of view to introduce 'special' characters to represent end of line characters. Trigraph replacement is trivial - trigraphs are three character sequences, beginning with two question marks, that represent single characters not available on all non-English keyboards, e.g., ??/ becomes \ and ??< becomes {.
  2. "Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character."
    This phase is trivial too.
  3. "The source file is decomposed into preprocessing tokens and sequences of whitespace characters (including comments). A source file shall not end in a partial preprocessing token or comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined. The process of dividing a source file's characters into preprocessing tokens is context-dependent. For example, see the handling of < within a #include preprocessing directive."
    Commonly called 'lexing', this phase produces tokens from characters. Since this is a well-documented and mechanical phase, I shall not examine it closely.
  4. "Preprocessing directives are executed and macro invocations are expanded. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively."
    Phases 1 to 4 are sometimes performed by a separate preprocessing program. For C++, the preprocessor is less important that it was for C and its use is to be viewed with some suspicion (except for #include which, unfortunately, we still cannot do without). I shall not examine preprocessing in any detail.
  5. "Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set."
    This phase mechanically converts the content of character constants and string literals and holds few surprises. I shall not mention it again.
  6. "Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated."
    Hands up, who has used this feature? Not many, I'll bet. This was introduced by ANSI during the standardisation of C to provide an alternative to some rather unpleasant preprocessor tricks and to help those programmers who were dealing with very long string literals. If you think

    	const char* hi = "Hello" /* and */ "world!";
    looks odd, then avoid this feature. In fact, avoid it anyway!
  7. "White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated. The result of this process starting from a single source file is called a translation unit."
    Now this is compiling! This single phase is the vast bulk of your average C++ compiler and it is this phase that I shall be explaining in the forthcoming articles.
  8. "The translation units that will form a program are combined. All external object and function references are resolved."
    Oh, and this is the linker, tacked on at the end! For C++ the water is muddied somewhat by the process of instantiating templates on demand when the compiler needs all those pieces of code you haven't written yourself. I shall come back to this in a future article on templates.

Next Issue

In the next article of the series, I shall look at phases 1 - 4 in a bit more detail, and present the first few code fragments.