So you want to be a cOOmpiler writer?


At the end of part II, I provided some skeleton classes and asked you to consider what the interfaces should be. I'm going to start by fleshing out one of those interfaces and then look in a little more detail at some aspects of preprocessing. I may come back to the other classes in a future issue but I am no longer in a position to divulge some of the details that I had planned!

At source

The class I want to look at is the base class Source. In a purely abstract sense, all we can say for sure about it is that we can repeatedly 'get' items from a Source until it is 'empty':

	template<class T>
	class Source
	{
	public:
	  virtual T	get() = 0;
	  virtual bool	empty() const = 0;
	  virtual ~Source() { }
	};

The member functions are both "pure virtual" because there can be no sensible generic implementation for them - they must be provided by more derived classes. The class has no data members - no state - so a default constructor (implicitly generated by the compiler) is appropriate here. However, we must provide an explicit virtual destructor because the default would be non-virtual and could lead to problems later on. In many ways, assignment and copy construction are irrelevant because the class has no state information. What kind of Sources will we be dealing with in reality? We've already said each Phase is a Source but we also need the lowest-level Source: a file. Or is it? An istream is a more general Source than an ifstream so perhaps we should consider that first:

	template<class T>
	class StreamSource
	: public Source<T>
	{
	public:
	  StreamSource(basic_istream<T>& i_) : i(i_) { }
	  virtual T	get() { T c; i.get(c); return c; }
	  virtual bool	empty() const { return i.eof(); }
	  virtual ~StreamSource() { }
	private:
	  basic_istream<T>& i;
	};
	StreamSource<char>	charSource;
	StreamSource<wchar_t>	wideSource;
By templatizing StreamSource and using basic_istream, we take our first steps towards 'global' programs - a useful point to remember.

Whither STL?

Kevlin Henney noted in Overload 9 (Applying the STL mindset) that the phases could probably be rewritten in terms of iterators and, given the above class interfaces, we are going to end up with a lot of code that does something like this:

	Source<T>& st = ...;
	while (!st.empty())
	{
	  T t = st.get();
	  // do stuff with t
	}
STL-style iterators would indeed allow us to rewrite this as:

	Source<T>& st = ...;
	for (Source<T>::iterator i = st.begin();
	     i != st.end(); ++i)
	{
		T t = *i;
		// do stuff with t
	}
By definition, however, a Source<T>::iterator would be an "input iterator" and these are one-pass iterators which come complete with a lot of semantic restrictions. For the initial input phases of preprocessing, this is not a great problem but as the mapping involved in each phase becomes more complex the restrictions associated with input iterators make them unworkable. The approach that I took was to implement all the early phases with the get/empty interface and then collect all the tokens that came out of preprocessing into a list that could be iterated over by the parser:

	Preprocessor*	p = new Preprocessor(filename);
	list<Token>	source;
	while (!p->empty())
	{
		source.push_back(p->get());
	}
	parseProgram(source.begin(), source.end());
This is somewhat simplified because the actual input to the parser is the output of phase six whereas the preprocessor is only really phases 1 to 4. Assuming that we really wanted to implement the input iterator for a Source, let's examine how we'd go about it:

	template<class T>
	class Source
	{
	public:
	  class iterator : public input_iterator<T>
	  {
	  public:
	    iterator(Source<T>* sp_) : sp(sp_) { }
	    T		operator*() { return sp->get(); }
	    iterator&	operator++() { return *this; }
	    iterator	operator++(int) { return *this; }
	    friend bool operator==(const iterator&,
	                           const iterator&);
	  private:
	    Source<T>* sp;
	  };
	  iterator begin() { return iterator(this); }
	  iterator end() { return iterator(0); }
	  // ... as before ...
	};
The implementation of the equality operator is left as an exercise for the reader. Note two things:
  1. this assumes the != operator provided by STL - a template operator that guarantees "x != y" means "!(x == y)",
  2. every time you use operator* on the iterator, the Source is advanced.

That last point may surprise you - in fact, the semantics of input iterators are under review by the C++ committee at present and it may turn out that the above implementation violates the intended requirements of an input iterator (I don't believe it does at present). If the requirements change, then the 'current item' would need to be cached within the iterator and a few other adjustments made - I may revisit this in a future article.


Parsing a program

In C++, a program is a sequence of declarations, so it shouldn't surprise you that I implemented that as follows:

	void parseProgram(
		list<Token>::iterator cur,
		list<Token>::iterator eof
	)
	{
	  while (cur != eof)
	  {
	    cur = parseDeclaration(cur, eof);
	  }
	}
parseDeclaration takes a pair of iterators specifying a range, [cur, eof), and returns an iterator that refers to the next, unparsed Token in the list. I will return to this later in the series.

Of symbols...

In Overload 8, I commented that the symbol table is an obvious abstraction in a compiler. The main symbol table required in C++ is relatively complicated because it needs to be "scope-aware" but the preprocessor also needs symbol tables - for macros and preprocessor directives. First of all, let's look at what information we need for a preprocessing token: its name (or spelling), its "key" (e.g., IDENTIFIER, HASH, WHITESPACE) and its position in the source code. We need the latter to be able to accurately report the location of warnings that we detect later on in the analysis. For the purposes of preprocessing, only a few different types of token are important, in particular there are no keywords.

There are two types of macros: object-like and function-like. In an ideal world we could implement these along the following lines:

	class Macro
	{
	public:
	  Macro(const string& n_, const list<Token>& b_)
	  : name_(n_), body_(b_) { }
	  virtual		~Macro() { }
	  list<Token>		expand() const = 0;
	protected:
	  const string&		name() const { return name_; }
	  const list<Token>&	body() const { return body_; }
	private:
	  const string		name_;
	  const list<Token>	body_;
	};
	class ObjectMacro : public Macro
	{
	public:
	  ObjectMacro(const string& n_, list<Token> b_)
	  : Macro(n_, b_) { }
	  virtual		~ObjectMacro() { }
	  list<Token>		expand() const;
	};
	class FunctionMacro : public Macro
	{
	public:
	  FunctionMacro(const string& n_,
	                list<Token> b_, list<Token> p_)
	               	: Macro(n_, b_), params_(p_) { }
	  virtual		~FunctionMacro() { }
	  bool			bind(const list<Token>& args);
	  list<Token>		expand() const;
	private:
	  list<Token>		params_;
	// ...
	};
Then preprocessing would proceed like this:

	  if (token.key() == IDENTIFIER)
	  {
	    MacroTableIterator m =
	        macroTable.find(token.name());
	    if (m != macroTable.end())
	    {
	      if (FunctionMacro* fm =
	          dynamic_cast<FunctionMacro*>(&*m))
	      {
	        // collect arguments from token stream
	        if (!fm->bind(args))
	        {
	           warning(token.where(), BAD_MACRO_CALL,
	                   token.name());
	        }
	      }
	      list<Token> expansion = m->expand();
	    }
	  }
Notice how we can bind the macro arguments after downcasting in the case of a function-like macro but always despatch the macro expansion from the base class. Unfortunately, RTTI is not portable enough at the moment to allow us the luxury of implementing macros like this. Instead, I implemented it all in one class (Macro) and provided a method isFunction() that indicated whether binding arguments was a sensible operation - an engineering compromise. macroTable is implemented as a hash table - something sadly missing from the draft standard library - that maps the name of a macro to the implementation of the macro (hash_map<string,Macro>).

The preprocessing operations that affect macros can be easily implemented:

	// #define macroName body
	// #define macroName(args) body
	  MacroTableIterator m = macroTable.find(macro.name());
	  if (m == macroTable.end())
	  {
	    macroTable[macro.name()] = macro;
	  }
	  else
	  {
	  // if args & body are not identical to previous
	  // definition, warn about it
	  }

	// #undef macroName
	  MacroTableIterator m = macroTable.find(macro.name());
	  if (m != macroTable.end())
	  {
	    macroTable.erase(m);
	  }

Service included

Since the #include directive causes phases 1 to 4 to be recursively applied to the specified file, implementation should just be a matter of creating a new preprocessor on that file, 'get'ing all the Tokens from it and priming the owning preprocessor with that list of Tokens - this implies that Preprocessor has a cache:

	bool Preprocessor::empty() const
	{
	  return cache.empty() && lexer.empty();
	}
	Token Preprocessor::get()
	{
	  if (cache.empty())
	  {
	    // perform normal preprocessing
	  }
	  else
	  {
	    Token t = cache.front();
	    cache.pop_front();
	    return t;
	  }
	}

	// #include filename
	  Preprocessor* included = new Preprocessor(filename);
	  while (!included->empty())
	  {
	    cache.push_back(included->get());
	  }
	  delete included;

Coming soon

Whilst I have obviously glossed over many of the details of preprocessing, I hope that this gives you a feel for what is involved. In part IV, I shall move on to look at representing the C++ type system and some of the engineering considerations involved.