× News Cego SysMT Croom Web Statistics Impressum
SysMT Logo

The Dragon Parser Generator

3   Grammar definition
Back to TOC

This section gives a detailed description of the dragon grammar file. Dragon provides a grammar description language, that combines a part for regular expressions for token recognition and one for the production description. Furthermore a header defines the overall behaviour of the parser.

The grammar is written down in a regular ascii text file. For parser code generation, the grammar file is feed to dragon.

3.1   Header

In the header part, special signs and tokens are declared. The header is introduced with the keyword HEADER and is closed with the keyword END .

Between the header boundaries, the following declarations can occur

The SEPIGNORE declaration followed by a character definition is used to declare separator signs, which the parser uses for token separation. Normally, white characters ( like ' ' and '\t' ) are used as SEPSIGN characters. But for special purposes, any character can be used.

The SEPSIGN declaration tells the parser to treat the defined character as a token separator but also return the sign as a valid token. To work correctely, the corresponding sign must be declared later on in the tokenset as a valid token.

In some cases, regular expressions are not powerful enough to define parser tokens. For example, string values with any contents cannot be defines in a simple way. For this kind of tokens the IGNORE clause can be used. Followed by a token identifier, the token can later be used in the grammar but must not be defined in the token set by a regular expression. The IGNORE token can later be set up by the user written nextChar streaming method.

3.2   Tokenset

The tokenset part is introduced with the TOKENSET keyword and closed with the END keyword.All tokens, that should be used in the grammar must be defined in the tokenset ( except the IGNORE tokens defined in the header ). Tokens are defined as regular expressions followed by a “:” sign followed by the token identifier. The token identifier is later used in the grammar. Regular expressions for the dragon parser generator are defined in a similar way like for other lexical analying tools. In general an expression is a concatenation of several characters. There are special characters which tells the lexical analyer how to build up the final state machine. Special character are the folllowing

Control Character Meaning
* The following character ( group ) can occur N times in the analyes string. N can be a value >= 0
[ ... ] The character ( group ) s within the brackets are alternatives. One of them can occur in the analysed string
'....' The surrounding quotes define a character group. Any character except the single quote char can occur between the quotes.
[c1-c2] One of the characters from c1 to c2 in their numerical ascii ordering can occur in the analysed string.
( ... | ... | ...... ) Alternative character groups can be defined within the round bracket. Several alternatives are speparated by the '|' character.

Several regular expression elements can be combined as needed. A typical example is definition of valid integer numbers. This can be done with the following expression

( 0 | [1-9]*[0-9] )

3.3   Productions

A set of productions defines the language, that should be parsed. The production defintions occur in the production set block which is introduced with the keyword PRODUCTIONSET and is closed with the END keyword. A single production consists of a production name, a derivation and an optional semantic action. In this sense, a grammar production must match the following form

< production >   :   < derivation >   ;   < semantic action >

A derivation is a sequence of valid production and token id's. An empty derivation is allowed. A semantic action is the name of a method, which must be implemented later on as part of the inherited parser class. We will discuss the semantic action later on more detailed. A sample production set could be as followed

A   :   a b B ; doA
B   :   b ; doB

A and B are the defined productions and a and b are valid tokens defined in the token block. The doA and doB are methods, which are called, if the production could be reduced by the parser later on while analysing input. A production can have one or more derivations, which means the production id on the left side of the colon can appear more than one time. In this sense, we can extend the sample above to the following valid grammar

A   :   a b B   ; doA
A   :   a a B   ; doAnotherA
B   :   b   ; doB