So you want to be a cOOmpiler writer?


The story so far...

When I started this series in Overload 5, I intended to show you what makes a compiler tick and how you design and build such a beast in C++. I was going to illustrate this with snippets of Programming Research code and discussion of some of the "bad practice" issues that our QA C++ product detects. In future articles in this series, I may yet do that, but much has happened since I wrote that introductory article over a year ago and I want to digress somewhat.

A little diversion

I've been using C++ for about three and a half years and have been involved with the standardisation process all that time, initially attending BSI panel meetings then ISO meetings and for the last 18 months ANSI meetings as well. When my company first looked at C++, we had about 200K lines of C in our products - much of it K&R C but being migrated to ISO C by virtue of our in-house coding standards being automatically applied. Our decision to adopt C++ for future development meant that we would have to deal with mixed language development because we intended to reuse many of our generic library components - all written in C. My long-term plan was to migrate all the C code to C++ as I felt this would make maintenance easier, so we reviewed our coding standards for C to outlaw all incompatibilities with C++ and began to incrementally apply them. By the time I started writing this series, we had around 50K lines of C++, and about 60K lines of our C code was compilable as either C or C++. As I write this second instalment, one year on, the balance has completely changed: we have only about 100K lines of pure C code left and about 140K lines of C++.

Why am I telling you this? I mentioned our generic library components above. These are not container classes and so on but reusable code that deals with command-line argument processing, configuration and message files, data filters and error databases. We have been able to reuse most of our GUI code for three of our four products. These are all "obvious" components in products like ours and the result is that each of our language analysis products comprise a parsing engine - typically 40K lines - and support and GUI code totalling a further 50K lines. For three products, that makes 3 x 40K + 50K = 190K lines and it also means that the parsing engine is far and away the greatest contribution to any new product we plan. So we've been looking at parsing abstractions in order to make it easier to build new engines...


A typical compiler

At the heart of a typical compiler are three things: a lexer, a parser and a symbol table. In order to make compiler writing easier, tools were developed that can generate lexers and parsers from symbol and grammar descriptions. UNIX developers know these as lex and yacc supplied with every system. There are many variants (e.g., flex, byacc) and PC versions are available too - recently, C++ wrappers have also appeared. In essence, they all work in the same way: you write a description of a symbol, or sentence in the language, and lex, or yacc, produces code that "accepts" it.

	/* some tokens using lex: */
	IDENT: [A-Za-z_][A-Za-z0-9_]*
	NUMBER: [0-9]+
	%%
	{IDENT}		return IDENT;
	{NUMBER}	return NUMBER;
	"+"		return PLUS;
	"-"		return MINUS;
	"*"		return TIMES;
	"/"		return DIVIDE;

	/* an expression using yacc: */
	%token
	IDENT NUMBER PLUS MINUS TIMES DIVIDE
	%%
	expr: factor addop factor ;
	factor: term mulop term ;
	term: IDENT | NUMBER ;
	addop: PLUS | MINUS ;
	mulop: TIMES | DIVIDE ;

The error-handling in both of these tools is legendarily poor and not all languages easily fit the mould of yacc's restricted grammar description (e.g., typedef in C means that the lexer has to be smart enough to tell the parser whether an identifier symbol is a type name or not). The effort involved in making languages like C++ or FORTRAN "yacc-able" is huge. For instance, FORTRAN has no keywords:

	      INTEGER IF(100)
	      I=1
	C     assign to element of array
	C     called if:
	      IF(I)=2
	C     if statement:
	      IF(I) GOTO 10

and it allows non-significant whitespace in identifiers:

	C     start of DO-loop:
	      DO 10 I = 1,3
	C     assign 1.3 to variable
	C     called do10i:
	      DO 10 I = 1.3

In C++, deciding whether a sequence of tokens is a declaration or an expression can involve an arbitrary amount of lookahead.

In our parsing engines, we use a mix of yacc-like table-driven parsers and hand-crafted recursive descent or state-transition parsers (I'll come back to these terms later in the series). For lexical analysis, we tend to hand-craft because we usually want to add a lot of checks into the lexer to support coding standards (again, more on this later in the series).


What about reusability?

Since the grammar is different for every language, you clearly cannot reuse the parser. Lexical structure is often similar, so we may make some progress there and basic symbol table operations have enough commonality that quite a lot of reuse should be possible. Again, these are "obvious" candidates for reuse because they are fairly concrete. The reason that I held this column over from Overload 7 is because I want to look beyond the obvious candidates for some more abstract ones: the sort of candidates touched on by the columns on multiple inheritance and object relationships.

Back on track

In Overload 5, I outlined how a C++ compiler worked, in terms of the phases of translation and said that part II would take a slightly closer look at phases 1 to 4. Before diving into the mechanics of each phase, let's take a step back and look at what a phase does.

Each phase is a mapping - in particular, phase three is a mapping from a stream of (internal) characters to a stream of tokens. Later phases map streams of tokens to different streams of tokens, earlier phases map streams of characters to different streams of characters. Perhaps we could say:

	class Lexer
	: public Mapping<char, pptoken> ...

But a phase is also a source of characters or tokens or...

	class Lexer
	: public Source<pptoken> ...

What operations does a phase support? You can obviously "get" a character or token...

	class Lexer ... {
	public:
		pptoken get();
	};

But since we saw that a phase is a source, we might expect get() to be inherited from the base class source<pptoken>.

The client of each phase is the next phase in the sequence - such clients will either be hard-coded (statically bound) or instantiated at run-time (dynamically bound). The former approach might look like:

	class Preprocessor .... {
	private:
		Lexer	source;
	};

whereas the latter approach might look like:

	class Preprocessor ... {
	public:
		Preprocessor(Lexer* source)
		: lexer(source) ...
		...
	private:
		Lexer*	lexer;
	};

or more realistically:

	class Preprocessor ... {
	public:
		Preprocessor(Source<pptoken>* source)
		: lexer(source) ...
		...
	private:
		Source<pptoken>*	lexer;
	};

Note that any source of pptokens is an acceptable argument to Preprocessor - it doesn't have to be a phase, i.e., Preprocessor does not need access to any of the mapping machinery in its argument. Finally, we would want to explicitly write out the abstraction of "phase":

	template<class T>
	class Source { ... };

	template<class From, class To>
	class Mapping { ... };

	template<class From, class To>
	class Phase
	: public Source<To>,
	  public Mapping<From, To> { ... };

	class Lexer
	: public Phase<char, pptoken> { ... };

	class Preprocessor
	: public Phase<pptoken, pptoken> { ... };

I'm going to stop there for now. Elsewhere in this issue, various authors discuss multiple inheritance and virtual. Since both of those play a big part in what comes next, I'd like you all to think about the code above. As an exercise, you might like to try to write the interfaces for the base classes Source and Mapping - what services do they provide and what initial information do they need?