So you want to be a cOOmpiler writer? - part VI

by Sean A. Corfield

Introduction

In previous columns I've looked at representation of information within a preprocessor and a parser but I haven't said much about parsing itself. I maintain that parsers aren't really that difficult so this issue might be light relief for some of you who felt last issue's multiple inheritance scenario was a bit of a nightmare!

In the draft C++ standard, the clause describing statements is just about the shortest clause in the document. Statements are simple. Discuss.

Parsing considerations

Apart from expression statements and declaration statements, C++ begins each statement with a unique keyword: if, while, do, switch, for, goto, return, break, continue. This makes parsing quite simple because we can determine what is expected to come next based on a single token lookahead:

Token token = lexer.get();
switch (token)
{
case IF:     return ifStmt(lexer);
case SWITCH: return switchStmt(lexer);
case OPEN_BRACE:
             return compoundStmt(lexer);
//...
default:
  // must be a declaration or expression
  return declOrExpr(token, lexer);
}

Before you all howl in protest at the multiple returns, let me just say that I wouldn't write it like that - I'm just saving space on the page!

What am I assuming here? This is the body of a general stmt function that takes a lexer as an argument (well, actually a phase six token stream) and returns a statement, or rather a Statement*. The functions called above are similar and each reads from the lexer, constructs a statement and returns a pointer to it. Let's take a look at one of those functions in more detail.

ifStmt is called after we have seen the keyword if so the next token expected is (. After that we expect an expression, a ), and a statement. This is optionally followed by else and another statement. It will look something like this:

Statement* ifStmt(TokenStream& lexer)
{
  Statement* statement = 0;
  Token token = lexer.get();
  if (token == OPEN_PAREN)
  {
    Expression* condition =
                         expression(lexer);
    if (condition) // parse succeeded
    {
      if (lexer.get() == CLOSE_PAREN)
      {
        Statement* trueStmt = stmt(lexer);
        if (trueStmt)
        {
          Statement* falseStmt = 0;
          if (lexer.lookahead() == ELSE)
          {
            lexer.get(); // skip else
            falseStmt = stmt(lexer);
          }
          statement = new IfStmt(condition,
                                trueStmt,
                                falseStmt);
        }
      }
    }
  }
  return statement;
}

For simplicity I've omitted error reporting. As an exercise, you might like to consider different ways to implement robust error reporting and recovery.

As you can see, this calls the general statement parsing function recursively to handle substatements. This top-down parsing approach is called recursive descent parsing and is a common and simple technique that allows parsers to be written for complex languages.

Other parsing techniques

Another approach for parsers is to use yacc (Yet Another Compiler Compiler) that takes a grammar description and produces a parser that is a complicated state machine. Instead of the fairly readable code above, a state machine parser uses a switch statement to determine what to do with any given token based on the current "state". A state machine to parse C code has several hundred states. For C++, the grammar is too complicated and ambiguous for a mechanical translator like yacc and some fairly "clever" tricks need to be performed by the lookahead to seed the token stream with "hint" tokens. Cfront is based on such an approach.

What to do with statements

Having built our statement, what would we like to do with it? If we're going to do source code analysis (my original project, you may recall), we would want a method to check the semantics of statements and their subcomponents. For an if statement, that might look like this:

void IfStmt::semantics()
{
  condition->semantics();
  trueStmt->semantics();
  if (falseStmt)
    falseStmt->semantics();
  if (condition->type() == ASSIGNMENT)
    warning("Possible = instead of ==?");
}

Perhaps, instead, we wish to generate assembly language from it - compile it. A compile method might look like this:

void IfStmt::compile()
{
  condition->compile();
  Label* label = getLabel();
  branchIfZero(label);
  trueStmt->compile();
  if (falseStmt)
  {
    Label* end = getLabel();
    branch(end);
    label->write();
    falseStmt->compile();
    end->write();
  }
  else
  {
    label->write();
  }
}

The same recursive approach that we used in the parser appears naturally in the methods that operate on the statements produced by the parser.

Separation for reuse

You might be asking why doesn't the parser perform the desired operations as it builds each statement? I hope you can answer that question yourself with a little thought - separating the operations from the parser means that the parser can be reused for any other tools we may want to build that accept the same source language. However, as written, the parser is still quite closely linked to the actual operations because it has to know about the types of the statements. In order to produce a truly generic parser we need to somehow abstract out the specific types and their operations. Since the parser creates the objects directly, the solution is slightly harder to find than it might first appear.

A common approach for providing different behaviours for a common type is to use inheritance with the varying behaviour implemented in different derived classes. However, this means that the specific derived class objects must be created directly by the parser. That's precisely what we are trying to avoid.

Offhand, I can think of two solutions, I'm sure there are others. The first solution came to mind because of my fondness for templates. The second solution came about as I tried to make the first solution more elegant.

A parser template

For each different application of our parser, we end up with a separate method in each derived statement class that performs a specific operation (code generation, source code analysis etc). If the parsing functions were templates, the operation could be a template parameter somehow, which could be used to create the appropriate derived class instance, e.g.,

template<class Operation>
Stmt<Operation>* ifStmt(PhaseSix& lexer)
{
  Stmt<Operation>* statement = 0;
  Token token = lexer.get();
  if (token == OPEN_PAREN)
  {
    Expr<Operation>* condition =
              expression<Operation>(lexer);
    if (condition) // parse succeeded
    {
      if (lexer.get() == CLOSE_PAREN)
      {
        Stmt<Operation>* trueStmt =
                    stmt<Operation>(lexer);
        if (trueStmt)
        {
          Stmt<Operation>* falseStmt = 0;
          if (lexer.lookahead() == ELSE)
          {
            lexer.get(); // skip else
            falseStmt =
                    stmt<Operation>(lexer);
          }
          statement = new
              IfStmt<Operation>(condition,
                                trueStmt,
                                falseStmt);
        }
      }
    }
  }
  return statement;
}

Note that the Operation template parameter is not used in the signature of the parsing functions - we rely on explicit qualification of the template function calls. Unfortunately, this is a relatively new feature and not widely supported. Below I'll show how to work around this.

What is the Operation class and how does it solve our problem? Since we are trying to add a method to our framework of statement classes, Operation can be used to provide a base class for Stmt as follows:

template<class Operation>
class Stmt : public Operation
{
//...
};
template<class Operation>
class IfStmt : public Stmt<Operation>
{
//...
};
class Compiler
{
public:
  virtual void compile() = 0;
};
class Analyser
{
public:
  virtual void semantics() = 0;
};
// use these to read and compile a
// statement:
stmt<Compiler>(lexer)->compile();
// and to read and analyse a statement:
stmt<Analyser>(lexer)->semantics();

This technique of deriving from a template parameter allows us to introduce methods and other information above an existing class and can be very powerful, e.g., when a class is written that depends on, as yet unknown, abstractions or functionality.

To provide the actual method in the derived classes, we can specialise it:

template<> // recent syntax for
           // specialisation
void IfStmt<Compiler>::compile()
{
// as above
}
template<>
void IfStmt<Analyser>::semantics()
{
// as above
}

Unfortunately, we need to declare the method in each of the derived classes that we intend to specialise it for which means we also have to specialise those derived classes:

template<>
class IfStmt<Compiler>
: public Stmt<Compiler>
{
public:
  // other methods
  void compile();
};

This isn't really very elegant but at least the parser is now generic.

A parser factory

This mixture of compile-time polymorphism (templates) and run-time polymorphism doesn't work too well in this instance. Our problem lies in having to select the operation (compile, analyse) at compile-time. If we have a hierarchy of statements that "compile" and another hierarchy of statements that "analyse", can we choose at run-time which one we want? Yes, we can use a statement factory to create the appropriate objects for us.

class StatementFactory
{
public:
  virtual ~StatementFactory() { }
  virtual Statement*
          newIfStmt( Expression*,
                     Statement*,
                     Statement* ) = 0;
  //...
};
class CompilerStatementFactory
 : public StatementFactory
{
public:
  CompilerStatementFactory() { }
private:
  Statement* newIfStmt( Expression* e,
                        Statement* t,
                        Statement* f )
  { return CompilerIfStmt( e, t, f ); }
  //...
};

The parser would then be constructed with the appropriate factory object and would use

statement = myFactory->newIfStmt( a,b,c );

instead of

statement = new IfStmt<Operation>( a,b,c );

(or however we created the specific types of statement objects). Note that the abstract base class has a virtual destructor but no constructors - the default will be sufficient (we could declare a protected default constructor if we really wanted to be more precise). Furthermore, the derived factory classes have only a public constructor and then all the methods are private - why? Because they are only ever called through the public methods in the base class - they are "merely" an implementation detail.

More genericity, please waiter!

Some people are never satisfied! We now have a parser that can be told, at run-time, to build either a compiler representation or an analyser representation. Can we actually make it more generic? How about if we could build a generic representation and then tell that what to do?

If we consider what operations we need to perform, we see that we can use a trick very much like that for the parser factory to defer the operation until run-time. This time we use a generic operation method within each statement class that delegates to a factory-like object method for each derived statement class's operation method:

void IfStmt::operation(
  const Operation& anOperation)
{
  anOperation->ifStmt(condition,
                     trueStmt, falseStmt );
}
class Operation
{
public:
  virtal void ifStmt( Expression*,
                      Statement*,
                      Statement* ) = 0;
  //...
};
class CompilerOperation : public Operation
{
private:
  void ifStmt( Expression*,
               Statement*, Statement* );
};
void CompilerOperation::ifStmt(
  Expression* condition,
  Statement*  trueStmt,
  Statement*  falseStmt
)
{
  condition->operation( *this );
  Label* label = getLabel();
  branchIfZero(label);
  trueStmt->operation( *this );
  if (falseStmt)
  {
    Label* end = getLabel();
    branch(end);
    label->write();
    falseStmt->operation( *this );
    end->write();
  }
  else
  {
    label->write();
  }
}

Static vs. dynamic

I hope this instalment has shown how sometimes a dynamic (run-time) solution can be more elegant than a static (compile-time) one. I hope you've also seen a pattern in the solutions above, where a (deep) class hierarchy can be mapped onto the methods in a (shallow) factory hierarchy and the resulting combination provides an elegant double-despatch, polymorphic in both the type of the "factory" and, in our case, type of the statement.

Next time

In the dark prehistory of this irregularly scheduled column, I threatened articles on templates, overloading and a myriad other things. For the next couple of issues I'm going to take a break and then go back and critically review some of the material in the first six articles - now is a good time to send me comments and / or questions on what you've read so far. It's also a good time to make requests about what you'd like to see covered in future articles in this column.