headerphoto

Introduction

This is not a place to learn about parser generators and formal language parsing in general. An Internet or online book store search will turn up many good books and publications for this. The references below are some of the ones I have found most useful but other newer books are available and should be considered. The discussion here is a simple, non-technical description of how APG works in order to understand its primary features.

APG was originally designed to generate recursive-descent parsers directly from the ABNF grammar specification of Context-Free languages (RFC5234). In response to a number of practical parsing problems that commonly arise, the method that has evolved differs from Context-Free parsing in several significant respects. The first of these is disambiguation. APG disambiguates with the “prioritized-choice” rule – if several alternates are available only the first successful match is accepted. A related constraint is that repetitions always match the longest possible string – alternatives are not considered. While these constraints may seem restrictive, in practice, by taking these disambiguating features carefully into account while writing a grammar, it is rarely difficult to accurately describe the desired language phrases. But the really powerful changes come when it is recognized that the formalism developed from the ABNF rules can easily be extended to include phrase-recognition elements or operations beyond the original, Context-Free set. Syntactic predicates - conditional parsing based on look-ahead phrases - and User-Defined Terminals (UDTs) are examples of this that will be discussed below.

[top]

Parsing Basics

An ABNF grammar defines a language with a set of rules. Taken together, this set of rules additionally defines the complete set of phrases and alphabet characters that make up the language sentences. Each rule definition is built from seven functionally-independent elements. The rules can be represented as a syntax tree, the nodes of which are the rule elements. The ABNF specification spells out explicitly what operations take place at each node. The seven ABNF elements (and the APG designation) are: Rule Name(RNM), Terminal Literal String(TLS), Terminal Binary String(TBS), Terminal Value Range(TRG), Repetition(REP), Concatenation(CAT) and Alternation(ALT).

The method underlying APG is simply to recognize that the ABNF definitions of these seven elements can be described by assigning operations to each of the syntax tree nodes. The APG generator reads the grammar definition and translates each operation into a set of numeric codes - operation codes or opcodes. The generated parser then uses this table of opcodes to traverse the syntax tree of nodes while processing an input language string.

We can illustrate this with a simple example.

SENTENCE = (RED / GREEN) (PAPER / LEAVES)
RED = "red "
GREEN = "green "
PAPER = "paper"
LEAVES = "leaves"

This grammar defines:
alphabet: {" ", "a", "d", "e", "g", "l", "n", "p", "r", "s", "v"}
or in ASCII alphabet character codes (lower and upper case)
{32,65,68,69,71,76,78,80,82,83,86,97,100,101,103,108,110,112,114,115,118}
phrases: {"red ", "green ", "paper", "leaves"}
language (the set of sentences): {"red paper", "red leaves", "green paper", "green leaves"}

The syntax tree is shown in Figure 1. Figure 1.
Figure 1.
A recursive-descent parser makes a "depth-first" traversal of the nodes. That is, beginning at the top, root node, the traveral moves down to the left-most, previously unvisited node. From the terminal nodes the direction is reversed upward to the node's parent. In this way, each non-terminal node is visited at least twice, once from above and once or more from below.

The operation at each node is to give the parser instructions on a) what to do and b) where to go next. For example, when approached from above each non-terminal node simply instructs the parser to proceed on to the left-most, unvisited node below. When approached from below each non-terminal operator has its own special operation to do. Simply put, ALT nodes select from alternate phrases and CAT nodes concatenate phrases. RNM nodes are named phrases which, on their own, do nothing. Their only operation is to give the user or parser the opportunity to do something with the recognized phrase. For example, build a table of names or an XML file of tagged phrases. This important processing is done in APG with rule name callback functions and is discussed in the Rule Name Callbacks section below. User-Defined Terminals (UDTs) are an important generalization of this rule name, callback concept.

But terminal nodes have much simpler operations. They simply try to match a pre-defined phrase and return to the parent. It is worth noting here that APG has been designed from the beginning to be a phrase-recognition parser. The terminal nodes are not single-character recognizers. They are hand-written phrase recognizers which either succeed or fail to recognize the entire phrase that defines them.

In general then, terminal nodes are phrase recognizers and non-terminal nodes are phrase manipulators.

[top]

Rule Name Callbacks

Rule name callback functions allow the user to interact with the parser at each defined phrase. The RNM nodes will call a user-written function twice for each node, first on the downward traversal of the node before any parsing of the phrase has been done and then again on the upward traversal after the Parser's phrase recognition work for the node is done. There are several types of actions that rule callback functions can accomplish.

  • downward traversal (pre-parsing of the branch below the node): Administrative work meaning any type of table or data buffer set up or similar data handling that does not affect the parsing of the string. Side calculations that are done parallel to the parsing process.
  • downward traversal: Override the Parser completely. That is, do your own phrase matching and skip the parsing completely. In this case, the rule callback us functioning as a UDT. However, implementation of a UDT in this manner is not obvious and a little clumsy. UDTs were developed to overcome this shortcoming.
  • upward traversal (post-parsing of the branch below): Again, any administrative work that might be convenient to do parallel to the parsing process.
  • upward traversal: Examine the Parsers's phrase matching result and decide whether to accept or reject it. For example, there may be non-CFG requirements that need to be enforced. This might be done by outright rejection, returning no match instead of match. Or you have another chance here for overriding the phrase recognition completely, and returning some phrase other than what the Parser found.

For details about writing and using rule name callback functions see the User's Guide for the target language of your choice.

[top]

User-Defined Terminals (UDTs)

Let's slightly rewrite the example grammar above.
SENTENCE = (RED / GREEN) (PAPER / u_LEAVES)
RED = "red "
GREEN = "green "
PAPER = "paper"

The ABNF terminal LEAVES has now been replaced with the User-Defined Terminal u_LEAVES. Since the underscore is not a valid character in ABNF, the parser generator has no trouble recognizing it as a UDT(*). And since it is user written it needs no ABNF rule to define it.

The new syntax tree now has a new operator UDT, as shown in Figure 2. Figure 2.
Figure 2.
As the name implies, this is a terminal node. Its only operation is to call a user-written, phrase-recognition function and return the result to the parent node. The User's Guides has details on the prototype for these functions and how to identify them to the Parser.

u_LEAVES has some "safe" uses and some "unsafe" uses. Both are very powerful enhancements to the parser.

A "safe" use would be to write it so that it simply recognizes the phrase "leaves" only is optimized to do it much faster than the ABNF terminal does. This use does not change the language that the grammar defines in any way. It just does it much faster. For example, in the SIP example released with the APG code I've managed to increase the parsing speed by a factor of 10 without changing in any way the language the original ABNF grammar defined.

An "unsafe" use would be, for example, to have it look backwards in the sentence string and if it sees "red" then it only recognizes "faced" and if it sees "green" it only recognizes "cheese". That is, it completely alters the language that the parser recognizes in a context-sensitive way.

A search for discussions of handwritten parsers will turn up many, many criticisms of them. The primary complaint is that there is generally no way to know what, exactly, the language is that is being parsed. But handwritten parsers also have two important advantages. The first being that handwritten parsers are usually much faster. The second is that they are capable of parsing non-Context-Free Grammars. For probably both of these reasons the GNU gcc compiler is, in fact, a handwritten, recursive-descent parser[gcc]. And it is for both of these reasons that UDTs have been introduced to APG.

To recap, there are three types of uses envisioned for the UDTs.

  • "safe" - to simply speed up the parsing process.
  • "unsafe" - to incorporate a few non-CFG language requirements into a grammar that is mostly CFG.
  • "totally dangerous" - to simply provide a recursive-descent backbone to a completely handwritten parser. UDTs are normally terminal nodes. However, there is a backdoor in APG to let them move on down the syntax tree. See the User's Guides for the Parser functions vExecuteRule() and vExecuteUdt(). You should really know what you are doing before using them, but you could build a parser for a completely arbitrary language with no ABNF grammar to guide it at all.

(*) UDT names can begin with either the "u_" or "e_" prefix. The difference between them is important. In order for the parser generator to determine the grammar's attributes it is very important for it to know which rules accept empty strings and which don't. Since UDTs are user written, there is no way for the parser generator to know whether a given UDT will or won't accept empty strings. Therefore, the convention has been adopted that only UDTs whose names begin with "e_" are allowed to accept empty strings. Those beginning with "u_" are not allowed to accept empty strings. This convention is enforced by the parser. That is, if a user-written function, u_myfunc() for example, returns a phrase length of zero, the Parser will fail with a terminal error and exit with a call to the currently installed alert handler.

[top]

Syntactic Predicates

Once you start playing around with writing ABNF grammars for various kinds of phrases, it doesn't take long before you want a phrase that Context-Free grammars can't express. Take a simple example - the C language comment of the form /*comment*/ The comment can contain any characters, including '*' and '/', but the combination '*/' must be recoginzed as the end. If we try the following it will always fail: c-comment = begin comment end
begin = "/*"
comment = *any
end = "*/"
any = %d9 / %d10 / %d13 / %d32-126

While this is a Context-Free phrase, because of our disambiguation rules, the comment rule will consume the longest phrase possible - which in this case is the entire remainder of the string. That is because the comment rule includes the end rule. When I first encountered this problem in version 4.0 I noticed that because of the operator formalism APG uses, there was an easy solution. I simply modified the 'repeat' operator (REP) to look for a certain phrase and stop when it finds it - a "repeat-until" operator. That looked like, comment = *any !end

It was a simple modification to the REP operator it accomplished its purpose.

Later I learned about syntactic predicates, first introduced by Parr and Quong[5] and later extended by Ford[4]. They defined the AND(&) and NOT(!) operators to look ahead for a specific phrase and only continue parsing if the phrase was found (&) or not found (!). Using their formalism, the "repeat-until" operator can be expressed in terms of the "standard" syntactic predicate NOT(!) operator. eg. comment = *(!end any) end APG versions > 4.0 have used the standard syntactic predicate operators.

The "comment" example is actually expressible as a Context-Free grammar. It is APG's disambiguation rule that makes it impossible to parse without syntactic predicates. However, there is a well-known example that we can use to demostrate that syntactic predicates really do extend languages beyond the Context-Free class. The grammar for anbncn, n>0 is well known as an example of a non-Contex-Free phrase. It cannot be described with a Context-Free grammar. However, using the AND(&) syntactic predicate we can accomplish it as such AnBnCn = &Prefix ConsumeAs BnCn
Prefix = AnBn "c"
ConsumeAs = *"a"
AnBn = "a" [AnBn] "b"
BnCn = "b" [BnCn] "c"
What happens here? "&Prefix" looks ahead to see that "anbn" exists and is followed by "c". If it does, the parser backtracks and proceeds, sucking up the string of "a"s, then finding bncn. You can try this out for yourself with the anbncn example on the interactive APG page.

[top]

Attributes

One of the first things you learn about recursive-descent parsers is that you can't parse a left-recursive grammar.
Start = Start "s" / "y" The parser will recurse infinitely (or more practically, until a stack overflow exception is issued) before it ever recognizes a single character of a single phrase. Left recursiveness is considered here to be an attribute of the grammar. It is an important attribute to know about, since it is a fatal flaw in the grammar. But there are other, equally important attributes.

In the following two sections below I'll define the attribute types and the recursion types which you will need to understand some of the information displayed by the parser generator. However, if you want to know the nitty gritty of how it is all done you will have to read the code.

[top]

Attribute Definitions

  • Left Recursiveness (fatal) -
    Start = Start "s" / "y" Fatal as discussed above.
  • Right Recursiveness -
    Start = "s" Start / "y" Not fatal but nice to know.
  • Nested Recursiveness -
    Start = "s" Start "s" / "y" Not fatal but important to know, since it is what distiguishes regular from context-free grammars. If this attribute is false the grammar is regular, if true the grammar is context-free.
  • Cyclic (fatal) -
    Start = Start Valid grammar syntax but defines no phrases.
  • Infinite (fatal)-
    Start = "s" Start This defines only infinite string phrases. Therefore, the parser will fail for every finite input string or sentence.
  • Empty -
    Start = "s" Start / "" The language described by this grammar includes empty strings. Not fatal, but knowing whether a rule's phrases can be empty or not is critical for determining the recursive attributes.

Note that it is possible for a rule to have more than one true attribute at the same time. For example, the infinite and empty rules above are also right-recursive. The other important thing to know is that these are aggregate attributes. That is, they only mean that an attribute can happen, not that it always does. For example, the empty example grammar can accept the empty string "", but it can also accept the non-empty string "s".

[top]

Types of Recursion

NOTE: The recursive types discussed here are only displayed with the C/C++ Version 6.3.

In the parser generator's displayed results you will see, in addition to the attributes, the rule names listed under several different types. These are:

  • N_RECURSIVE - non-recursive. The rule never refers to itself.
  • R_RECURSIVE - recursive. The rule refers to itself.
  • MR_RECURSIVE - mutually recursive. Sets of rules. Within each set, each rule refers to every other rule in the set including itself.
  • NMR_RECURSIVE - non-recursive but refers to one or more mutually-recursive rules.
  • RMR_RECURSIVE - recursive but also refers to one or more mutually-recursive rule sets of which it is not a member.

The reasons for these distinctions in recursive rule types is that they play an important role in the determination of the attributes. In particular, the mutually recursive sets create a catch-22 situation in that they present the seemingly impossible task of needing the attributes of one member of the set in order to determine the attributes of the others and vice versa. There is a way around this, which I won't go into here. But since the determination of which rules fall into which type must be done to determine the attributes, I've included an option in the parser generator to display them. If you are not interested in attributes do not use the /dattributes command line option.

[top]

The ABNF for SABNF Grammar

; ABNF for APG 6.3 SABNF
; Symbol alphabet is ASCII (code points 9, 10, 13, 32-126).
; SABNF variations from RFC 5234:
;  restrictions:
;    incremental alternates, rule2 =/ rule1, are not allowed
;    prose values, <prose value>, are not allowed
;    only the first successful alternate, left to right, is accepted
;      - (the "prioritized choice" rule)
;    only the longest possible match to repetitions is accepted
;
;  relaxations: 
;    forgiving line endings - CRLF / LF / CR
;
;  super set additions:
;    syntactic predicate operators, & and !, have been added
;    User-Defined Terminals of the form, u_rule and e_rule, have been added
;
File          = *(BlankLine/Rule/RuleError)
BlankLine     = LineEnd / (%d59/sp) *(sp / %d33-126) LineEnd
Rule          = RuleName owsp (IncAlt/DefinedAs) owsp Alternation owsp LineEnd
RuleError     = *(sp / %d33-126 / LineContinue) LineEnd
Alternation   = MultipleAlt / SingleAlt
SingleAlt     = Concatenation
MultipleAlt   = Concatenation 1*(AltOp Concatenation)
Concatenation = owsp (MultipleCat / SingleCat)
SingleCat     = Repetition
MultipleCat   = Repetition 1*(CatOp Repetition)
Repetition    = Rep / And / Not / Element
Rep           = RepOp Element
And           = AndOp Element
Not           = NotOp Element
Element       = TrgOp   /
                TbsOp   /
                TlsOp   /
                UdtOp   /
                RnmOp   /
                Group   /
                Option  /
                ProsVal
Group         = Lparen  Alternation Rparen
Option        = Loption Alternation Roption
ProsVal       = %d60 *(%d32-61/%d63-126) %d62
DefinedAs     = %d61
IncAlt        = %d61.47

RuleName      = alphanum
RnmOp         = alphanum
UdtOp         = udt-empty / udt-non-empty
udt-non-empty = %d117.95 alphanum
udt-empty     = %d101.95 alphanum
RepOp         = (rep-min StarOp rep-max) /
                (rep-min StarOp)         /
                (StarOp rep-max)         /
                 StarOp                  /
                 rep-min-max
AltOp         = owsp %d47
CatOp         = wsp
StarOp        = %d42
AndOp         = %d38
NotOp         = %d33
TrgOp         = %d37 ((Dec dmin %d45 dmax) /
                      (Hex xmin %d45 xmax) /
                      (Bin bmin %d45 bmax))
TbsOp         = %d37 ((Dec dString *(%d46 dString)) /
                      (Hex xString *(%d46 xString)) /
                      (Bin bString *(%d46 bString)))
TlsOp         = Ltls *TlsChar Rtls
TlsChar       = %d32-33/%d35-126 ; tab not allowed

Lparen        = %d40 owsp
Rparen        = owsp %d41
Loption       = %d91 owsp
Roption       = owsp %d93
rep-min       = rep-num
rep-min-max   = rep-num
rep-max       = rep-num
rep-num       = 1*(%d48-57)
Ltls          = %d34
Rtls          = %d34
dString       = dnum
xString       = xnum
bString       = bnum
Dec           = (%d68/%d100)
Hex           = (%d88/%d120)
Bin           = (%d66/%d98)
dmin          = dnum
dmax          = dnum
bmin          = bnum
bmax          = bnum
xmin          = xnum
xmax          = xnum
dnum          = 1*(%d48-57)
bnum          = 1*%d48-49
xnum          = 1*(%d48-57 / %d65-70 / %d97-102)

; Basics
alphanum      = (%d97-122/%d65-90) *(%d97-122/%d65-90/%d48-57/%d45)
owsp          = *space
wsp           = 1*space
sp            = %d9 / %d32
space         = sp      /
                comment /
                LineContinue
comment       = %d59 *(sp / %d33-126)
LineEnd       = %d13.10 / %d10 / %d13
LineContinue  = (%d13.10 / %d10 / %d13) %d32

[top]

References

[1] Alfred V. Aho, Ravi Sethi, Monica S. Lam and Jeffrey D. Ullman, 
    "Compilers: Principles, Techniques, and Tools",
    2nd Edition, Addison-Wesley, 2007
    
[2] John E. Hopcroft and Jeffrey D. Ullman, 
    "Introduction to Automata Theory, Languages and Computation", 
    Addison-Wesley, 1979
    
[3] R Gregory Taylor, 
    "Models of Computation and Formal Languages", 
    Oxford University Press, 1998
    
[4] Bryan Ford, 
    "Parsing Expression Grammars: 
    A Recognition-Based Syntactic Foundation", 
    Proceedings 31st ACM Symposium on 
    Principles of Programming Languages, 
    pp. 111-122, 2004
    
[5] Parr, Terence J. and Quong, Russell W.,
    "Adding Semantic and Syntactic Predicates To LL(k): pred-LL(k)",
    Proceedings of the 5th International Conference on Compiler Construction,
    pp. 263-274, 1994