Dragon - an object oriented parser generator

Users guide

by Bjoern Lemke, Draft 01.09.2007

Dragon comes under the GNU Copyright
(C)opyright 2006, 2007 by Bjoern Lemke


Also there are ( beside of lex and yacc) a lot of parser generation tools are available, this is another implementation of a powerful parser generator. The dragon approach results in a consequent object oriented integrated scanner and parser solution and scales also for very large grammar defintions.

Since dragon parses LR(1) grammars, the generation algorithm is quite cpu intensive building the sets of LR(1) elements. Using LR(1) instead of LALR decreases efficiency of the parser but avoids some reduce/reduce conflicts in some cases. For huge grammar defintions, state-of-the-art hardware ( especially CPU ) is strongly recommended.

Dragon provides a very clean and structured way for the integration application specific context code. Further, dragon combines the lexical and syntactical analysis and the corresponding code generation. The grammar is completely defined in a single description file.

As a result, dragon generates C++ class code that can be in included as a parser class. From version 1.2, dragon also supports generation of Java code. The code integration and usages is quite similar.

From dragon version 1.3, LALR parser generation mode is also supported and is used as default.

If you have built and tested the dragon software and find it useful ( or not ), I appreciate some feedback from you. Thanks in advance !

Table of Contents (TOC)

1. Who should use it ?
2. Getting Started
2.1. ... Java code generation
2.2. ... C++ code generation
3. Grammar definition
3.1. ... Header
3.2. ... Tokenset
3.3. ... Productions
4. Generating the parser
4.1. ... Parser conflicts
4.1.1. ... Shift/reduce conflict
4.1.2. ... Reduce/reduce conflict
5. Semantic actions
5.1. ... Sample action method
6. Advanced scanner feed
7. Parser analysis mode
8. Conclusion

1. Who should use it ?

Back to TOC

Dragon is an appropriate tool for those C++ and Java programmers, who need a parser for any specific grammar ( LR(1) ). Since the integration can be done very smooth, a complete parser application can be written very fast.
To use dragon, you should be familiar with the concept of context free ( programming ) languages in terms of lexical and syntactical analysis. A basic understanding of the principles of a scanner and parser modul in general is required to understand and use this parser generator.

2. Getting started

Back to TOC

To get into the flow of dragon, we start with a small sample grammar description which can be feeded directly into dragon. The grammar given below can be created using any simple text editor.
@ Sample Dragon description 
@ 
HEADER 
SEPIGNORE '\t' 
SEPIGNORE ' '
END

TOKENSET 
a  :  a
b  :  b
c  :  c    
END 

PRODUCTIONSET
Start	:	S a  ; A1
S	:	C C ; A2
C	:	b C ; A5
C	:	c ; A3
END
As we look to the grammar description more closely, the file consists of three sections.

At the beginning, there is a header section beginning with a HEADER-Token and ending with the first END-Token. Any lines above the HEADER-Token line are ignored and can be used for comments.

Inside the header section, some general rules are declared for the parsing procedure. In our example, two whitespace separator tokens are declared (Space and Tab). This declaration tells dragon scanner to treat these tokens just as token separators and don't return them to the dragon parser. In general, you can use any kind of characters for token separators, but in most cases, whitespace characters like SPACE and TAB seem to be appropriate.

The next section beginning with the keyword TOKENSET describes the set of lexical elements the dragon scanner should detect. Each line inside this section contains a regular expression describing the token detection rule and a corresponding token, which is returned by the scanner on detection. In the sample above, we declare three tokens a, b and c which are returned on the corresponding regular expressions as defined above. Note, that the notation of the regular expressions is very similar to BNF ( Backus Naur Form) and is described more detailed later on.

At last, the language grammar is described in the section opened with the keyword PRODUCTIONSET. Each line consists of the production name, a first separator ( ':') , the production defintion, a second separator ( ';') and the semantic action. The second separator and the semantic action are optional and can be omitted. As we look to the production definition, we see a sequence of known token symbols and other known production names building the production.

2.1. Java code generation

Back to TOC

Based on this grammar defintion, we will generate now the parser code using the dragon generator tool. For this, cut the code written above and paste it in a file called G1.def. From a Unix shell, invoke now dragon with
game > dragon -p G1 
Reading Header ...
Reading Terminals ...
Reading Productions ...
Checking Productions ...
Analysing ...
Starting LALR analysis ...
Reached LR(0) elements: 
1...3...4...5...
Lookahead calculation ...
X...X...
Closure creation ...
.....
Creating syntax table ...
Reached LR(1) elements: 
0...5
Generating Code ...
This produces the file G1.java with a class description for the parser class G1. For a complete parser application, this class must be derived and some methods have to implemented.
class G1Main extends G1
{
	String input = new String (c c a);
	int i=0;

	G1Main()
	{

	}
	public char nextChar() 
	{  
		if ( i < input.length() )
		{
			char c = input.charAt(i);
			i++;
			return c;
		}
		else
			return 0;
	};

	public void backChar() {}
	public static void main(String[] args)  
	{
		G1Main g1 = new G1Main();
		try {	
			g1.parse();

		} catch ( Exception e ) {
			e.printStackTrace();
		}
	}
}

2.2. C++ code generation

Back to TOC

To get C++ parser code, you have to ommit the -j option during for the dragon generation call.
$ dragon -p G1
Reading Header ...
Reading Terminals ...
Reading Productions ...
Checking Productions ...
Analysing ...
Reached LR(1) elements: 
1...5...10...11...
Generating Code ..
This produces the following files

To complete the code for the compiler, you still have to add a main routine, scanner feed routines and some semantic actions. This can be done with the following program
//
// g1main.cc
// ---------
// Testmodul for dragon
//
// by Bjoern Lemke, (C)opyright 2003 by Bjoern Lemke

#include Chain.h
#include G1.h
#include Exception.h

class G1Main  : public G1 {
public:
    G1Main(char *pC) : G1()  {  _pC = pC; _i = 0;  }
    char nextChar() {
        if (_pC[_i])
        {
            _i++;
            return (_pC[_i-1]);
        }
        return 0;
    }
    void backChar() {  _i--;  return;  }
    void A1() { cout << A1 is called << endl; }
    void A2() { cout << A2 is called << endl; }
private:
    int _i;
    char *_pC;  
};

main(int argc, char** argv)
{
    G1Main g(argv[1]);

    try {
      if (g.parse())
        cout << Parse OK << endl;
      else
        cout << Parse Failed << endl;
    }
    catch (Exception e)
    {
        cout << e.asString() << endl;
    }
}
Now you can compile and assemble the modules to an executable
$ g++ -I../base/ -c g1main.cc
$ g++ -I ../base /-c G1.cc
$ g++ -o g1 g1main.o G1.o ../base/libBase.a
Note: The libBase library is available as opensource on www.lemke-it.com

3. Grammar definition

Back to TOC

This section gives a detailed description of the dragon grammar file. Dragon provides a grammar description language, that combines a part for regular expressions for token recognition and one for the production description. Furthermore a header defines the overall behaviour of the parser. The grammar is written down in a regular ascii text file. For parser code generation, the grammar file is feed to dragon.

3.1. Header

In the header part, special signs and tokens are declared. The header is introduced with the keyword HEADER and is closed with the keyword END .

Between the header boundaries, the following declarations can occur

The SEPIGNORE declaration followed by a character definition is used to declare separator signs, which the parser uses for token separation. Normally, white characters ( like ' ' and '\t' ) are used as SEPSIGN characters. But for special purposes, any character can be used.

The SEPSIGN declaration tells the parser to treat the defined character as a token separator but also return the sign as a valid token. To work correctely, the corresponding sign must be declared later on in the tokenset as a valid token.

In some cases, regular expressions are not powerful enough to define parser tokens. For example, string values with any contents cannot be defines in a simple way. For this kind of tokens the IGNORE clause can be used. Followed by a token identifier, the token can later be used in the grammar but must not be defined in the token set by a regular expression. The IGNORE token can later be set up by the user written nextChar streaming method.

3.2. Tokenset

The tokenset part is introduced with the TOKENSET keyword and closed with the END keyword.All tokens, that should be used in the grammar must be defined in the tokenset ( except the IGNORE tokens defined in the header ). Tokens are defined as regular expressions followed by a “:” sign followed by the token identifier. The token identifier is later used in the grammar. Regular expressions for the dragon parser generator are defined in a similar way like for other lexical analying tools. In general an expression is a concatenation of several characters. There are special characters which tells the lexical analyer how to build up the final state machine. Special character are the folllowing
Control Character Meaning
* The following character ( group ) can occur N times in the analyes string. N can be a value >= 0
[ ... ] The character ( group ) s within the brackets are alternatives. One of them can occur in the analysed string
'....' The surrounding quotes define a character group. Any character except the single quote char can occur between the quotes.
[c1-c2] One of the characters from c1 to c2 in their numerical ascii ordering can occur in the analysed string.
( ... | ... | ...... ) Alternative character groups can be defined within the round bracket. Several alternatives are speparated by the '|' character.
Several regular expression elements can be combined as needed. A typical example is definition of valid integer numbers. This can be done with the following expression
( 0 | [1-9]*[0-9] )

3.3. Productions

A set of productions defines the language, that should be parsed. The production defintions occur in the production set block which is introduced with the keyword PRODUCTIONSET and is closed with the END keyword. A single production consists of a production name, a derivation and an optional semantic action. In this sense, a grammar production must match the following form
< production >   :   < derivation >   ;   < semantic action >
A derivation is a sequence of valid production and token id's. An empty derivation is allowed. A semantic action is the name of a method, which must be implemented later on as part of the inherited parser class. We will discuss the semantic action later on more detailed. A sample production set could be as followed
A   :   a b B ; doA
B   :   b ; doB
A and B are the defined productions and a and b are valid tokens defined in the token block. The doA and doB are methods, which are called, if the production could be reduced by the parser later on while analysing input. A production can have one or more derivations, which means the production id on the left side of the colon can appear more than one time. In this sense, we can extend the sample above to the following valid grammar
A   :   a b B   ; doA
A   :   a a B   ; doAnotherA
B   :   b   ; doB 

4. Generating the parser

Back to TOC

The tokenset and the productionset must be written in a file called < parsername >.def. This file is used as input for the dragon parser generator. You just call the dragon from a command line with
$ dragon -p < parsername > ( without the .def postfix )
If the description is correct, this call results in the generation of two files ( < parsername >.h and < parsername >.cc ), containing appropriate C++-Source code. You can compile this classes using a C++-Compiler ( gcc ) and the base library, which is provided in a separate package.

4.1. Parser conflicts

As mentioned in the introduction, dragon is limited to LR(1) grammars. Sometimes, this leads to conflicts which cannot be resolved by dragon. In such cases, dragon fails with an exception message containing either an shift/reduce or reduce/reduce conflict.

Then the grammar description must be changed to make the parser generation work. In the following, we discuss to typical cases, where the grammar description is not appropriate and mus be changed.

4.1.1 Shift/reduce conflict

If dragon fails with a shift/reduce-conflict, the analyse algorihm has reached a state, where it cannot decide to shift ahead to read more input or to reduce at the current point to an appropriate production, because both decisions are valid. In this case, the grammar is ambiguous. Sample:
TOKENSET
'a'     :	a
'b'     : 	b
'c'     :   c
END

PRODUCTIONSET
Start   :       C a
C       :       c a
C       :       c a a
END  
With the grammar defined above, we will follow the parsers algorithm to parse the word caaa. With the first sign c, the token c is detected. There is no valid production to reduce, so the parser goes ahead to read more input. Now the sign a is read and the token a is detected. With the next read token a the parser results in the conflict. It either could reduce the stream with the production C-> ca, because the second token a is a valid follow up token ( Start -> Ca ) production C. But the parser also could shift ahead to read more input, because there is another valid production for the stream ( C-> caa ). There is no way for the LR-1 parser to resolve this situation. As a consequence, generating the parser code for this grammar would result in the following error message
$ dragon -p G2
Reading Header ...
Reading Terminals ...
Reading Productions ...
Checking Productions ...
Analysing ...
Reached LR(1) elements: 
1...3...5...6...
Dragon : shift/reduce conflict with production C-> c a.[a]
To solve the problem, you should modify the grammar in way, that the parser can derive a valid word of the described language in a unique way.
TOKENSET
'a'     :	a
'b'     : 	b
'c'     :   c
END

PRODUCTIONSET
Start   :       C
C       :       c a a 
C       :       c a a a
END  

4.1.2 Reduce/reduce conflict

If dragon fails with a reduce/reduce-conflict, the algorithm has reached a state where more than one production could be reduced, so dragon can not decide, which one to take. Sample:
TOKENSET
'a'     :       a
'b'     :       b
'c'     :       c
END

PRODUCTIONSET
Start   :       S a  ; A1
S       :       C b ; A2
S       :       D b
C       :       c a ; A3
D       :       c a
END

5. Semantic actions

Back to TOC

While parsing an input file, after each reduction of a grammar production a semantic action can be performed. A semantic action in terms of dragon is implemented as a derived method of the parser class. The action method are defined as virtual in the main parser class without any input args. The corresponding token values of the reduced production can be read with the provided

5.1 Sample action method

We will extend the method A1 of the G1 grammar sample in a way that the tokens are printed to stdout, which have been scanned by the scanner.
void A1() {

   Chain* pToken = getTokenList().First();
   while ( pToken )
   {
	cout << Token is :  << *pToken;
   }    
}
With the getTokenList() method provided by the parser class G1, the current list of tokens are provided to the semantic action. The list contains each token in string format ( base class Chain ) and the first element in the list is the last reduced token.

6. Advanced scanner feed

Back to TOC

The complete dragon parser implemention requires the implementation of the virtual nextChar and backChar method of the parser super class. The nextChar provides the next character from the input stream to the scanner. The backChar puches back the actual character provided to the scanner back to the stream ( for parsing resolution, dragon uses lookahead characters to resolve ambivalent analyse states). Furthermore special token definitions like string tokens must be defined within the nextChar method. The following sample implements a small parser parsing a comma separated list of string values ( string values are embedded into single quote signs ). First, we give the dragon parser definition.
HEADER
IGNORE STRINGVAL
SEPIGNORE '\t'
SEPIGNORE ' '
SEPSIGN ','
END

TOKENSET
',' :	KOMMA
END

PRODUCTIONSET
Start         :   StringList
StringList    :	  String KOMMA StringList
StringList    :	  String 
String        :	  STRINGVAL ; printString
END
This defintion is stored into the the file StringParser.def and is then processed by dragon.
$ dragon -p StringParser
As already discussed, the files StringParser.h StringParser.cc should be produced by the generation step. Now we have to implement the main modul for the string parser.
#define MAXSTRINGLEN 100

class StringParserImp : StringParser() {

public:

	StringParser(char* inputStream) {
		_pC = inputStream;
		_i=0;
	}
	~StringParser();
	
	char nextChar() {
	    if (_pC_i]) {
		if (_pC[_i] == '\'') {
	    	   setReserved(STRINGVAL);
	    	  _i++;
	    	   int j=0;
	         while ( _pC[_i] != '\'') {
		     _stringBuf[j] = _pC[_i];
		     j++;
		    _i++;
		    if (_i == MAXSTRINGLEN) {
 		      throw Exception(StringParserImp, stringbuf exceeded);
		    }
	         }
	         _stringBufLen = j+1;
	         _stringBuf[j] = 0;
	         _i++;
	         return 0;
	       }
	       _i++;
	       return (toupper(_pC[_i-1]));
          }
          return 0;
	}
	void backChar() {
		_i--;	
	}
	void printString()	{
		cout << String is  << _stringBuf << endl;	
	}
private:
	char* _stringBuf[MAXSTRINGLEN];
	int _stringBufLen;
	char* _pC;
	int _i;
}
If the nextChar method detects a quote character ( ' ), it calls the super class method setReserved with the detected token value ( in our case STRINGVAL ). This call informs the superclass to take the given token value as the next token value. Note: The token value given to the setReserved method must correspond to the token IGNORE statement given in the parser defintion file.

The value of the the string is stored in a dedicated character array _stringBuf managed by the derived parser implementation class. If the string value is completely scanned, a stream value of zero is returned from the nextChar method to the scanner. Finally the main routine has to be written down. This may look like the following
#include StringParser.h
#include StringParserImp.h

main(int argc, char** argv)
{
	StringParserImp stringParser(argv[1]);
	
	try {
   		stringParser.parse();
	}
	catch Exception( e )
	{
		ListT msgList = e.getMsgList();
		Chain *pS = msgList.First();
		while (pS)
		{
		    cout << *pS << endl;
		    pS = msgList.Next();
		}
		exit(1);
	}	
}

7. Parser analysis mode

Back to TOC

Dragon supports two algorithms for analysing the given grammar and create a corresponding parser module. The default way is to use the LALR method, which is much faster and produces smaller parse tables. In some cases, it might be useful to a LR(1) analyse method ( to avoid some conflicts ). For this, you sould use dragon in with the LR1 parse analysis mode
$ dragon -p G1 -j -m lr1
Reading Header ...
Reading Terminals ...
Reading Productions ...
Checking Productions ...
Analysing ...
Reached LR(1) elements: 
1...5...10...11...
Generating Code ..

8. Conclusion

Back to TOC

With the dragon parser generator, another powerful code genereation tools is provided to the open source community. It enables the parser programmer to focus on the defintion of the grammar and on the implementation of all semantic methods