APG - An ABNF Parser Generator

Version 5.0 Description
(Released October 15, 2007)

1 Summary

Version 5.0 continues the development of APG with the following new features:

  • optional, partially-predictive parsing tables (PPT) can reduce parsing times by factors of 2-4
  • completely re-written in ANSI C, it is smaller, faster and more portable
  • generates parsers in C or C++
  • recognizes the same set of languages as Parsing Expression Grammars (PEG)
  • horizontal as well as depth-first traversal of the AST for more efficient syntax-directed translation in many cases
  • grammar metrics and parsing statistics have been expanded and improved to include measures of determinism and linearity
  • recursion analysis is more thorough and informative
  • released under the GNU General Public License, version 2 or higher, for better compatibility with other Open Source projects.


2 Introduction

APG 5.0 is a generator of recursive-descent parsers with partially-predictive parsing tables, backtracking, prioritized-choice disambiguation and syntactic and semantic predicates. Its parsers differs from the usual implementation of recursive-decent parsing in several significant ways.

  • Parsers are generated from grammars written in a superset of the standard ABNF grammar formalism. Grammars published in ABNF, as most Internet standards[4] are, for example, can be used as input to APG directly without having to translate them into a special format.
  • Semantic actions are written into call back functions which are automatically called at the appropriate times by the parser. The grammar remains pure grammar, uncluttered by language-dependent code.
  • The operations of alternation, concatenation and repetition are done with formal operators that occupy nodes in the syntax tree. This standardizes their behavior and eliminates the need to hand code them into the recursive rule name (non-terminal) functions.
  • Syntactic predicate operators have been added which enhance the ABNF set of operators (APG grammars are formally a super set of ABNF) enabling the parsing of some non-context-free grammars.
  • Simple, intuitive restrictions on the repetition and alternation operations simplify, speed and disambiguate the parsing operations.

APG 5.0 gets a performance boosts from two improvements. Most significantly, partially-predictive parsing tables (PPT) have been added as an optional feature. PPTs come with a small footprint trade off, but are input-string independent and can reduce the number of parsing operations and execution times by factors of 2 to 4, depending on the grammar. A second, modest but significant improvement is due to a redefinition of the alternation and concatenation operators, reducing the number of parsing operations (opcodes) generated for any given grammar.

Seven complete applications are included. There are a couple of simple, but instructive examples. One illustrates how to generate and use parsers in C++ and the other demonstrates the ability to parse some non-context-free grammars. However, the non-trivial sample applications are drawn from Internet protocols. In particular, they focus on the increasingly important media session (e.g. Internet phone call) protocols. They start with relatively simple but instructive examinations of the IPv6 addressing scheme and Universal Resource Identifiers (URI.) Fairly complete studies of the SDP, SIP and MEGACO grammars are then given. No claim is made that the grammars and parsers are interoperable with industry-standard applications, but they should be good foundations for building such.

In the description of APG 5.0 I've made an effort to use standard notation and terminology and to place APG into context with other standard methods. However, the discussion is brief, informal and in the first person. A more complete and rigorous write up is planned but not promised.

Any use of APG 5.0 constitutes agreement to the terms and conditions of GNU General Public License, version 2 or higher. APG 5.0 comes with ABSOLUTELY NO WARRANTY. It is free software, and you are welcome to redistribute it under certain conditions. For warranty and conditions see the GNU General Public License version 2, a copy of which is included with this release and may be found at http://www.gnu.org/licenses/old-licenses/gpl-2.0.html or by writing to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.. For commercial use or any other use not permitted in the license agreement, please contact us to discuss custom arrangements.


3 Description

The descriptions here will be brief with no pretense of completeness. I'm assuming that the reader is, at a minimum, familiar with references [1-3] and I defer to those for more complete referencing of the large body of other relevant work. Nonetheless, I will try to use conventional terminology as far as possible and place it in context with prior work. If you have corrections, suggested references or other comments, I would be happy to hear from you.

APG 5.0 parsers use recursive-descent parsing with backtracking, prioritized-choice disambiguation, syntactic and semantic predicates and partially-predictive parsing tables. None of these concepts is new. It is the particular combination of these concepts and their implementation that distinguishes APG 5.0 from other parser generators. Especially notable are:

  • the generalization of terminals from single alphabet symbols to character range and characters string operators
  • the generalization of non-terminals from production names to rule (production) name, alternation, concatenation, repetition and syntactic predicate operators
  • the compact form of the partially-predictive parsing tables (PPT)
  • the use of call back functions to separate the semantic actions from the grammar definition
  • the Single-Expansion Syntax Tree (SEST) and its visually intuitive and algorithmic usefulness for PPT and many other kinds of derivations
  • the equivalence of APG 5.0 parsers to those of Parsing Expression Grammars (PEG) and hence the useful set of completeness and other properties that distinguish PEG.


3.1 Terminology

APG terminals, non-terminals and syntax trees are somewhat different from the standard text book definitions. I'll try to present just enough terminology here to describe them.

Range: A range is a contiguous set to of integers defined with brackets as [n,m]. The brackets imply that n <= m and that both n and m are included in the range.

Alphabet: An alphabet (Σ) is a finite set of N symbols (σ), Σ = {σi}, i = 0, 1, …, N-1. The parsers generated by APG manipulate only the indices, i. It is strictly up to the application to map the indices to specific symbols.

Character: A character is any single symbol in the alphabet. Here a character refers only to the symbol index and is simply an integer in the range [0,N-1].

String: A string is an ordered list of symbols taken from an alphabet. Here a string is simply an array of integers. The length of a string is the number of characters in the list or integers in the array.
Empty String: An empty string is a string of zero length, usually represented as epsilon (ε).
Sub-string: A sub-string is any contiguous list of characters taken from within a string.
Prefix: A prefix is any sub-string which includes the first character of the string.
Suffix: A suffix is any sub-string which includes the last character of the string.

Language: A language is any set of strings over an alphabet.

Grammar: Informally, a grammar is any set of rules which generates a language. ABNF[4], BNF, EBNF[5] and PEG[3] are commonly used sets of rules for defining grammars. APG is developed using a superset of the ABNF rules.


3.2 SABNF Rules and Operators

Table 1. gives a summary of the set of seven operations defining the ABNF[4] rules plus an additional operation for syntactic predicates. This is formally a superset of ABNF, and the term SABNF will be used in this document to avoid confusion about what RFC 4234 does and does not define. A complete definition of SABNF, using a pair (lexer and parser) of ABNF grammars, is given in Appendix A. There are three terminal operations and five non-terminal operations. The names of these operators — TRG, RNM, etc. — are chosen to be consistent with the naming convention in the accompanying code. Sections 3.2.1, 3.2.2 and Table 1. give quick definitions of the operators. Section 3.2.3 provides a more detailed discussion of them and important cautionary notes on grammar-writing.


3.2.1 Terminal ABNF Operators

In BNF, the alphabet symbols are the only terminals. ABNF also defines strings of alphabet symbols and APG treats them as additional terminal operators. The terminal operations are primary operators defining the specific characters and strings of characters that can be combined to construct a language string.

Range: TRG - the range operator defines a range [n, m], of characters any one of which is valid in the language string. It is the BNF equivalent of n|n+1|…|m-1|m.

Binary String: TBS - the binary string operator defines any string of characters taken from the alphabet. It is the BNF equivalent of a concatenation of single-character terminal symbols.

Literal String: TLS - the literal string operator also defines a string of characters taken from the alphabet. It is only used in the special case where the alphabet consists of the ASCII character set and defines a case-insensitive string. It's BNF equivalent is a somewhat lengthy set of alternations and concatenations on upper and lower case letters.


3.2.2 Non-terminal SABNF Operators

In BNF, the rule or production name is considered to be the only non-terminal. APG also treats alternation, concatenation, repetition and syntactic predicates as non-terminal operations and provides an APG library function for each.

Rule Name: RNM - the rule name operator is a primary operator, and as in BNF, is a set of like-named productions. Its appearance in a rule indicates that the definition of the named rule is to be used in its place. APG uses it as a branching operator, branching to the list of operations defining the named rule and returning to the next operation when it has completed.

Alternation: ALT - the alternation operator is a binary operator indicating that either the operator on its left or right may be used. If more than one ALT operator appears in an expression, all alternates are collected into an ordered list of children which the ALT operator tests from left to right until a match is found.

Concatenation: CAT - the concatenation operator is a binary operator indicating that the sub-strings matched by the operations on the left and right are to be concatenated. If a rule alternate (production) has more than one CAT operator, APG collects all operands into an ordered list of children, concatenating the resulting sub-strings matched by each. If any child fails the entire CAT operator fails.

Repetition: REP - the repetition operator, n*m, is a unary operator indicating that the operation on its right is to be repeated up to m times or until failure. The repetition succeeds if the number of matched sub-strings, x, is such that n <= x <= m. Otherwise it fails.

Syntactic Predicate: PRD - ! and & are the two forms of the unary syntactic predicate operator. Both operators attempt to match the operator to its right and both then immediately backtrack to the starting character. The difference is that & returns the empty string on success and "no match" on failure while ! returns the empty string on failure and "no match" on success.


3.2.3 Operator Notes

This section provides some additional details and some important caveats about the APG operators. Representing Single Characters

The ABNF single character %dn (or %xn, %bn) is treated as if it were written %dn-n and interpreted as a TRG operator. TBS is not used for the single character simply because TRG results in slightly more efficient code. Therefore, TBS strings are always two or more characters in length. Representing the empty string

ABNF does not provide for a direct representation of the empty string. In principle, the repetition operation 0*0E or equivalently 0E would result in an empty string representation. However, the expression E is never invoked and the syntax generates unused operations. APG instead uses the empty literal string, "", to directly represent the empty string. The repetition 0*0 is treated as a grammar syntax error. Alternations

The ALT operator is the source of ambiguities in BNF, as both the left and right expressions may be successful. APG eliminates ambiguity by ignoring the right expression if the left is successful. If a rule has more than one ALT operator, APG collects all operands into an ordered list of children, selecting the first successful child and ignoring the remainder. Previously termed "first-success disambiguation" I will use the term "prioritized choice " in conformance with ref. [3]. It is an open question[3] whether or not the prioritized-choice interpretation of the ALT operation restricts the set of languages that can be represented to less than the context-free set. An examination of many BNF grammars shows that a prioritized-choice interpretation of it will leave some valid sub-strings inaccessible. However, my own experience is that with a careful examination of the language specification, a prioritized-choice grammar can be written to match it without need of the inaccessible sub-strings. This is, of course, no guarantee that this will always the case.

[top] Repetitions

The REP operator is the most complex of the set with many interesting properties — some that are advantageous and some that require careful consideration to avoid unwanted and unexpected behavior. The repetition n*mE is short hand notation for the alternation

Em/Em-1/ … /En+1/En.

If n=0, then 0*mE = 1*mE / ε and the repetition can match the empty string. (See the discussion of empty strings above.) APG never backtracks over a successful match which is the effectively same as the "greedy" interpretation of the PEG repetition operators ?, * and +. Note also that if E returns the empty string, then n*mE returns the empty string for all n,m such that 0 <= n <= m. This is because ε = ε0 = εn = εm.

Left recursion: It is well known that recursive-descent parsers fail on grammars that are left recursive. Fortunately, left recursion can always be eliminated, but the repetition operator both simplifies and optimizes this procedure. Consider the simple left-recursive grammar

G = G "g" / "x"

The normal procedure for removing left recursion is to rewrite the grammar as

G = "x" A
A = "g" A / ε

This adds an additional, right-recursive rule to the grammar that serves no useful semantic purpose. On the other hand, using a repetition operator simplifies the grammar to

G = "x" * "g"

The use of the repetition operator, a) introduces no extraneous, semantically useless rules, b) it simplifies the grammar, c) removes recursion altogether and d) works equally well for removing right recursion. Table 2. illustrates removing recursion with the repetition operator for a representative set of elementary rules.

Overlapping languages: Consider the simple grammar

G = *"a" "a"

The obvious intent here is to represent a string of one or more "a"s. Without the prioritized-choice rule this would work, even though the parsing would be quite inefficient. However, with the prioritized-choice interpretation, this grammar fails to match any string at all. The first term consumes the entire string and the concatenation always fails. In simple cases like this the fix is easy.

G = "a" *"a" = 1*"a"

does the job. The problem is with the case

G = *A B

If A and B both have complex definitions, the languages they generate may overlap in subtle and unobvious ways. Now suppose we have a string "ab" and we wish to interpret "a" semantically as an A phrase and "b" semantically as a B phrase. If "b" falls into that netherland where the languages of A and B overlap, a prioritized-choice parser will fail. (An exhaustive parser will not fail, but is less efficient and presents the alternate problem of interpreting multiple, ambiguous successes.) Many times I have puzzled over parsing failures on strings that looked perfectly correct, only to discover a subtle overlapping of the A and B languages. Syntactic predicates can help, but when using repetitions, the rules A and B need to be carefully examined in advance.

[top] Syntactic Predicates

In the previous APG version 4.0 a relatively simple modification of the repetition operator, called the "repeat-until" operator, was made out of necessity to solve a specific problem. The 4.0 notation used was

G = *A !B

The interpretation was, "repeat matching A until B is found." This turns out to be a limited, specialized form of a syntactic predicate[6]. In APG version 5.0 I have standardized the syntactic predicate operation to conform with the PEG definition and added a fifth non-terminal operator, PRD, to handle it. The "repeat-until" operation in the 5.0 notation is

G = *(!B A) B

NOTE: When parsing speed is important, syntactic predicates should be used sparingly. Since they always back track over the expression they parse, they can be inefficient. In my experience, with a little grammar writing discipline, they are rarely needed.


3.3 Relationship to Parsing Expression Grammars

Even though APG begins with context-free grammars and PEG pointedly does not, it is interesting to note that with version 5.0 they both appear to recognize the same set of languages. The next section gives the argument for that equivalence. The following section describes APG's facility for statistics gathering to assess the degree of parsing linearity for a given grammar and input string.

3.3.1 Language Equivalence of APG and PEG

The original goal of APG was nothing more than to generate parsers from ABNF grammars. But, by trial and error, the experience of the previous four releases has resulted in some expedient restrictions on and additions to the full set of context-free grammars that ABNF describes. Those restrictions and additions turn out to be nearly identical with the starting assumptions of PEG. They are:

  • alternations are disambiguated with a first-success rule — the prioritized-choice assumption of PEG
  • repetitions never backtrack on matched strings — the "greedy" feature of PEG repetitions
  • an additional "repeat-until" modification of repetitions was found useful for some parsing situations — a specialized, limited form of PEG's "not" syntactic predicate operator.

With the generalization of APG's "repeat-until" operator to conform with the syntactic predicates defined by PEG, the argument that the two formalisms parse the same set of languages, while not a rigorous mathematical proof, is just a matter of showing that every PEG operator can be duplicated with SABNF operators and vice versa. This is rather trivially done in Tables 3 and 4.


3.3.2 Parsing Linearity

Packrat parsing[7] guarantees linear parsing times — t = O(n), where t is the parsing time and n is the length of the input string — with an efficient means of memorizing the results of parsing each rule at each character in the string (memoization.) When the parser hits the a rule a second time with the same string character, it pulls the result out of the tables rather than repeating the parse of the rule. APG 5.0 does not attempt to do Packrat parsing but it does have a feature for collecting statistics on how often the duplication occurs. This is not a run-time option. It requires rebuilding the APG library with _APG_CFG_PACKRAT defined. The parsing statistics can then be displayed with the ApgUtilities program vPrintPackratRuleStats(). The complete results for a parse of a typical MEGACO[8] protocol message is shown in Fig. 1. Of the possible 105546 rule/character combinations, only 53 were visited more than once and those occur only at the leafy fringe of the parse tree where little would be lost by repeating the parse anyway. Using the PPT facility this is reduced to only 8 multiple visits. PPTs, as the "partial" in the name implies, will rarely result in a completely linear parser but they can significantly reduce the degree of non-linearity. Statistics for the Internet protocol problems in the sample applications indicate that the parsing of these messages is nearly linear and would get little, if any, benefit from a Packrat parser. That is not to say, of course, that other problems are not highly non-linear. This APG 5.0 feature is intended only as a discovery tool.

3.4 SABNF Syntax Trees

The text books define graphical representations of context-free languages variously as derivation trees, parse trees, concrete syntax trees and abstract syntax trees[1]. None of these quite describe the type of tree that I have found most useful in the development of APG. At the risk of some confusion but for lack of a better term, I will call the trees that I use for both method development and parsing algorithms simply syntax trees.


3.4.1 Syntax Trees

The chief difference between the APG syntax tree and the commonly-defined parse tree is that a parse tree describes only a single derivation. For the replacement of each rule only a single production is used. That is, for each ALT operator, only a single alternate appears in the parse tree. In an APG syntax tree every terminal and non-terminal operator appears as a node in the tree, including all alternate children of an ALT operator. The APG syntax tree therefore represents every possible derivation simultaneously as opposed to only one. As a simple example, using all eight of the SABNF operators, Fig. 2 illustrates the syntax tree of the grammar for a C-style comment (the symbol alphabet is the ASCII character set)

comment = begin *any end ; C-style comment
begin = %d47.42 ; TBS representation of "/*"
end = "*/" ; TLS representation of */"
any = !(begin / end) %d32-127 ; any string excluding delimiters

In practice, of course, begin or "/*" would be allowed in a comment, but that is stretched a bit here for illustration purposes. Grammars with recursive rules are examined in a later section.


3.4.2 Abstract Syntax Trees

The trees used for syntax-directed translation are Abstract Syntax Trees (AST) and conform to the definition in ref [1]. That is, the nodes of an AST contain only semantically interesting programming or translational constructs. For APG parsers that means an AST has only the RNM operators of semantic interest as nodes and each node has its matched phrase (sub-string) attached. For example, Fig. 5 illustrates the AST for a parse of the string "/*ab*/" with the above grammar. Note that, in principle, REP keeps its matched repetitions as sibling children but in practice REP itself is not kept in the AST.


3.5 The Single-Expansion Syntax Tree (SEST)

The definition of the syntax tree given above has an obvious problem when the grammar contains recursive rules. There is nothing to stop the rule name replacements and the tree has infinite depth. Intuitively, one feels that a single expansion of each rule should be sufficient to expose all of the grammar's characteristics. Heuristically, we know from the pumping lemma[2] for CFGs that for sufficiently long strings, some phrases or sub-strings must be repeated and the proof entails repeated expansions of recursive rules. In fact, it would not seem inappropriate to call the rule name leaf nodes (defined below) as the "pumps" of the pumping lemmas.

A simple example, shown in Fig. 4 , illustrates the point.

S = "s" S / "y"

defines the language sny, n >= 0. A single expansion of S reveals the two cases n = 0 and n = 1. Further expansions of S add no new phrases to the string.

Without attempting any serious mathematics, it is postulated here that the SEST contains the mathematical equivalent information of a push-down automaton for the same language. Whereas the language properties in a push-down automaton are state-based, the information in a SEST is phrase-based. True or not, the SEST has at least proven sufficient for the purposes of APG — recursion analysis and computation of the PPTs. We therefore define the Single-Expansion Syntax Tree (SEST) for the general case. The possibility of mutually-recursive rules requires us to take some care with the definition.


3.5.1 Definition

As a working example let's extend the grammar for C-style comments to include nested comments. To that end we use

comment = begin *any (comment / "") end ; C-style nested comments
begin = %d47.42 ; TBS representation of "/*"
end = "*/" ; TLS representation of "*/"
any = !end %d32-127 ; any string except "*/"

Here again, we stretch the definition a bit for illustration purposes. In practice we would probably use [comment *any] rather than the grouping (comment/""). However, the grouping illustrates an explicit empty string and the second *any term just complicates the syntax tree without illustrating any new features. Note that the statistics in Fig. 3 now indicates at least one of everything.

First, we define a branch of the syntax tree as the unique sequence of nodes leading from the start rule (the tree root node) to any single terminal node.

We define a rule name root node as the first occurrence of a specific rule name in a given branch. That is, its node forms the root of a sub-tree.

We define a rule name leaf node as the second occurrence of any specific rule name in a given branch. That is, if a given rule name has already appeared once in the branch above it, it is a leaf node and treated as a terminal node. The branch ends here.

We can now define the single-expansion syntax tree (SEST) as the syntax tree whose leaves are all terminal operators TRG, TBS, TLS or RNM leaf nodes.

The reason for carefully specifying a single given branch in the definition of rule name leaves is that in the case where two or more recursive rules mutually refer to one another, the sub-trees of the rule name root nodes may differ in different branches. This is illustrated in Fig. 6 for the, somewhat artificial, grammar:

S = A / B
A = "a" B / "y"
B = "b" A / "y"

We see there that the sub-trees of rule name root node, A, are different in the two different branches.


3.5.2 Depth-First Traversal as a Formula Derivation Technique

Although the details of the algorithms are not give here (they are, of course, in the code, albeit obscurely), I just want to mention here that both the recursive attributes of a grammar and the partially-predictive parsing tables are evaluated by defining appropriate node operations during one or more depth-first traversals of the SEST.

I also note in passing that preliminary work indicates that this same technique - node operations on a depth-first traversal of the SEST - appears to be sufficient for deriving the handles and items used in building the transition matrix of bottom-up parsing methods as well. In fact, the derivations seem to me to be more visual and intuitive than the text book derivations I have seen.

The technique also indicates an easy definition of a recursive-ascent[9-10] equivalent of the push-down automata, bottom-up methods. The primary reason for not pursuing this to develop a "bottom-up APG" is that the bottom-up approach seems to lose its primary advantage over top-down methods once the terminals are generalized from single alphabet characters to strings of characters as the TBS and TLS terminal operators do. My understanding is that the primary advantage of the bottom-up methods is that they have the ability to follow multiple branches simultaneously as the input string characters are consumed, one by one. This simultaneous pursuit of two branches vanishes when the two consume different strings, even though those strings may have identical prefixes.


3.5.3 Limitations

It is easy to see that the number of nodes in the SEST will rapidly become large as the tree depth grows. In fact, a crude study of this growth for a sequence of small to large grammars indicates the growth is larger than N = O(D10), where D is the maximum SEST depth and N is the number of SEST nodes. In practice, I have found that for somewhere between 130 < D < 190 the recursion analysis and PPT computation times become impractically long. However, once these one-time computations have been completed, the actual parsing of an input string does not reflect this computational effort and can be quite efficient. This is because the parse tree for a single derivation can be quite small even though the SEST is huge. Therefore, APG can still be used on large grammars if recursion analysis and PPT computations are skipped. Of course, you lose the speed improvements that PPTs give, but you can still build a small, viable APG parser. This situation is no different from the problems encountered by bottom-up methods for large grammars. The number of states, in this case, can grow impractically large as well.


3.6 Opcodes

The primary function of the parser generator is to recognize each of the APG operators in the grammar and convert it to a numerical format called an operator code or opcode. A list of opcodes and supporting data is generated. Supporting data is a string table, a rule list and the numerical data in each opcode

3.6.1 String Table

This is just a byte array of binary data. Within it are all of the rule names, binary and literal strings. The rule names are informational, null-terminated ASCII strings. The strings used by the TBS and TLS operators are simply arrays of 8-bit integers.

3.6.2 Rule List

For each rule name an additional data structure is defined. The rule list is an array of these structures. The structure contains a pointer to the rule name in the string table, a pointer to the first opcode of the rule in the opcode table and a pointer to the callback function for the rule (callback functions are discussed below.)

3.6.3 Opcodes

The table of opcodes is a contiguous set of opcode lists, one list for each named rule. There is roughly one opcode in the list for each term in the rule name definition. The opcodes within a list are rearranged according to operator precedence and useless and redundant opcodes are removed. Useless opcodes are ALT and CAT operators with only a single child and REP(n,m) operators with n = m = 1. Only one ALT per grouping is kept, since all alternates in the group are handled as siblings to a single ALT. Similarly, only a single CAT is retained for a concatenation of two or more terms. The supporting data for each operator is:

  • TRG: min and max of the character range (min = max for a single character).
  • TBS: pointer into the string table and length of the binary string.
  • TLS: pointer into the string table and length of the literal string.
  • RNM: The index into the rule list for this rule.
  • CAT: pointer to the first and last opcodes to be concatenated.
  • ALT: pointer to the first and last alternate opcodes.
  • REP: min and max of the repetition defined. Pointer to the child opcode.
  • PRD: "and" or "not" form and a pointer to the child opcode

3.7 Semantic Actions as Call Back Functions

A call back function is optionally associated with each named rule. It is a static function to prevent naming conflicts and a "shell" function is generated for each by APG. These functions are called by the parser at various stages of the attempts by the parser to match a sub-string of the input string. Semantic actions are implemented as user-written code in the call back functions. The call back functions are called twice during string parsing - traversal of the syntax tree - and twice during translation - traversal of the AST.


3.7.1 Semantic Actions During Parsing

The call back function for a given rule is called twice during the parsing or string matching process, once just before an attempt to match the rule begins and once upon completion. Translations of the input string can be done during the parsing process, although care must be taken to allow for possible backtracking. The call just before can be used for any initialization that needs to be for the translation. However, this call can also be used as "semantic" form of syntactic predicate. For example, consider the rule for comment in Section 3.3.1 above. Suppose the call back function for comment has been called and the look ahead character is "/". Semantically we know that the next character must be "*" or the match to comment will fail. We can easily check the next character and return failure if it is not "*", sparing the parser the time and trouble of attempting the match syntactically. As demonstrated in the IPv6 sample application, this can lead to a significant performance boost.

3.7.2 Semantic Actions During AST-Directed Translation

During parsing the parser can optionally construct an AST. A call to the semantic analysis function will then do a depth-first traversal of the AST, again calling the call back functions for each rule once just before the rule's branch is traversed and once on completion.

3.7.3 Controlling the Calls

Because grammars often have many rule names that are not of semantic interest, it is undesirable to make calls to call back functions for all rules. Therefore, facilities are provided to control the creation of the call back functions. The user can specify which functions are to be called during syntax analysis and which are to be called during semantic analysis. If a function is not to be called during either, its generation is omitted. To optimize parsing performance, it is important to carefully consider which functions are to be used and use the grammar writing rules to eliminate those that are not needed.

3.8 Partially-Predictive Parsing Tables (PPT)

The use of predictive parsing tables is well known but useful only for deterministic or LL(1) grammars. However, for non-deterministic grammars — those requiring a backtracking parser — it is often the case that some or many alternate choices are still deterministic. For example, consider the grammar for alphanumeric strings (we will only allow upper case A, B or C and 1,2 or 3 for simplicity.)

alphanum = alpha *(alpha / digit)
alpha = "A" / "B" / "C"
digit = "1" / "2" / "3"

We can easily see that the rules alpha and digit are deterministic. alphanum is not deterministic, since even if the look ahead character is A, B or C, there may be other characters ahead in the string that are not alphanumeric. However, it is partially deterministic. If the look ahead character is not alphabetic, we know for sure that the match will fail. If it is alphabetic, it might succeed and might not. We have to go ahead and parse the alphanum rule as usual non-deterministically. Therefore, we can generate a partially-deterministic parser by associating a partially-predictive parsing vector with each rule name. Each vector entry has four possible values for each character in the symbol alphabet, including a virtual end-of-string (EOS) character.

Table 5 illustrates the vectors for the rule names in the example above. The action of the RNM operator is then modified to first check the vector for the rule. If the return value is deterministic, the parsing of the rule is skipped. If not, it is parsed normally.

In practice a partially-predictive vector is associated with every opcode, not just the RNM opcodes. The partially-predictive parsing table (PPT) is the 2-D collection of all vectors for all opcodes. The size of the PPT is O(N x M) where N is the number of symbol alphabet characters (including a virtual EOS character) and M is the number of opcodes. These are created once at parser generation time. In the sample applications the addition of PPTs provide performance gain factors of 2-4. The parser's static data is increased from 39K to 87K in the worst case among the examples — not particularly significant on today's computers with gigabytes of memory.


3.9 Recursion Analysis

It is well known that recursive-descent parsers fail for left-recursive grammars. Additionally, some grammar rules define only infinite strings and others define no strings at all. For simple grammars, whether or not these pathological conditions exist is usually obvious on inspection. However, for complex grammars with long derivations it may not be so obvious and it is helpful to have an algorithm to check this out ahead of time. And while we are at it, there are a number of other grammar attributes that are interesting to discover. The parser generator can optionally determine these attributes prior to parser generation. The grammar attributes the parser generator checks are the yes/no answers to the following questions:

Does the grammar accept empty strings?
Does the grammar accept finite strings?
Is the grammar recursive?
Is the grammar left-recursive?
Is the grammar right-recursive?
Is the grammar nested-recursive?
Is the grammar cyclic?

Note that the attributes are start-rule dependent and that a grammar may exhibit more than one "yes" attribute. Table 6 shows some elementary examples of grammars exhibiting each of these attributes.

3.10 The Bootstrap

The original bootstrap for APG 5.0 was written as an APG 4.0-generated parser. However, there were many intermediate stages in the development and many stages used its predecessor as the bootstrap. As a consequence, there is no single existing bootstrap program for APG 5.0. In its final form, however, APG 5.0 serves as its own generator. The generator is built from two APG parsers, one for the pre-processor which generates tokens from the grammar and one for the token parser. It currently reproduces both of these parsers exactly, line for line of code, when the grammars of Appendix A are used for the input.


4 Implementation

The native language is C. The parser generator, the parser library and the generated parsers are all written in C. However, there is a simple C++ wrapper to encapsulate the basic parser and the generator has an option to output the generated parser as a C++ class.

4.1 Coding Practice

APG uses typedefs for the common unsigned integers:

uchar — unsigned char
ulong — unsigned long
ushort — unsigned short

ints are rarely used. ulong is by far the most commonly used integer. Because most integers are ulongs, the Booleans:

APG_TRUE = (ulong)1
APG_FALSE = (ulong)0

are defined which are used in most Boolean expressions.

4.1.1 Hungarian Notation

There are controversial pros and cons for Hungarian notation. It usually boils down to a matter of preference. I find it convenient to be able to recognize a variable's type by simply looking at its name. I use the following Hungarian notation fairly consistently for variable names and for function names to indicate the data type of their return values:

c — char
cp — char
*uc — uchar
ucp — uchar
*ul — ulong
ulp — ulong
*us — ushort
usp — ushort
*s — struct
sp — pointer to a struct
pfn — pointer to a function
i — int
g_ — global variable

I don't like global variables, but occasionally convenience trumps avoidance. In these cases I try to use the g_ prefix.


4.1.2 Macros

I personally find it difficult to read someone else's code when it is heavily populated with unfamiliar macros. But I use them primarily for three purposes.

  1. to provide simple, consistent, single-line checking of error conditions with location identification,
  2. statistics gathering and tracing and
  3. to easily remove all code that they generate when they are unneeded without a lot of conditional #if-#endif pairs cluttering the statement flow.

The naming conventions for the macros should make their purpose obvious with very little APG usage experience.


4.1.3 Error Checking

Error checking is done with a variety of xxxASSERT macros which test a condition and report the file and line number on a false condition. In particular there is a liberal sprinkling of DASSERT() macros throughout. These provide debug checking only and produce no code at all in the release versions. Commonly used assertion macros are:

DASSERT() — debug assertions
VASSERT() — vector component assertions
PASSERT() — parser component assertions
CASSERT() — catalog component assertions
xxxASSERT() — specialized assertions in the examples,
            — xxx = SPD, SIP, etc.

4.1.4 Use of the "goto" Statement

The goto statement is generally regarded as bad coding practice and for good reason. However, a conflicting good programming practice is to provide a single point of exit from functions. From nested iteration loops, "goto" is almost the only way out without seriously contorting the code. Even when there are no loops, the use of "goto" facilitates simple code for an early exit from a function — often because of detection of an error condition, but in other situations as well. I make heavy use of "goto", but in a consistent and single-purposed manner, usually hidden in ASSERT macros.

Most APG functions end with an "APG_EXIT:" label, and use "goto APG_EXIT; " as a means of early exit to the single "return" statement at the end of the function. Through the use of ASSERT macros which exit on error with "goto APG_EXT; " it is easy emulate the stack unwinding provided by C++ but missing from the C language, with a call stack trace of the error location in the error history of the memory component.

4.1.5 Statistics Gathering

I want statistics gathering to be completely removed from the released versions so that they won't be burdened with lots of conditionals. Macros are a convenient means of doing this. These are primarily limited to the parser and parser operators.


4.1.6 APG Library

The two things I missed most from C++ were the automatic stack unwinding on a thrown exception for error reporting and recovery, and the dynamic buffering of the <vector> template from the Standard C++ Library. To facilitate error reporting and recovery is the "memory" component. The "vector" component is the replacement for the <vector> template. These are the lowest level components upon which most other components are built. The descriptions of the APG Library components given below are general descriptions. See the source code for usage details.

4.1.7 Components

Components serve much the same purpose as C++ classes. Data encapsulation is easily achieved with an explicit "this" pointer which takes the form of an opaque context pointer. APG does not rely heavily on inheritance anyway, and simply passing low-level component pointers to the creation of higher-level components is adequate. Polymorphism is rarely needed, but occasionally function pointers are defined and conditional assignment of the functions they point to is quite adequate. The component standard is to assign a unique name to the component, define a constructor and destructor and member functions using the name as a base. For example, usage of the memory component might look like,

// construct a memory component
void* vpMemCtx = vpMemCtor();

// allocate a buffer of 100 ulongs
ulong* ulpBuffer = (ulong*)vpMemAlloc(vpMemCtx, 100 * sizeof(ulong));

// report an error
ulErrReport(vpMemCtx, 999, __LINE__, __FILE__);

// destroy the memory component

NOTE: Even though error reporting is done through a memory component context, the functions use the "Err" naming conventions rather than "Mem" due to historical origin.

The library is designed for compactness. No I/O functions are used. This is to keep the parser components to a minimum foot print. The primary purpose of the accompanying ApgUtilities library is to provide some optional but common functions for displaying parser information. The component descriptions are given below in order of lowest-level to highest-level. Thread safety has not been explicitly tested, but all components should be thread safe as long as no two threads try to access the same component context.


4.1.8 Tools

Currently the only tool provided is a 32-bit cyclic redundancy check, taken from RFC 3309. This is used occasionally as a simple hash function for data integrity checking. Tools use no other library facilities.

4.1.9 Alerts

The lowest level of the APG Library is an alert handler. Alerts are a means of reporting a serious error condition and capturing the source code file name and line number where it occurred. vAlertHandler() is the public function for sending alerts. The default is simply to exit the program so it is recommended that you override the default with an application-specific procedure. It is highly recommended that you write a function for intercepting structured exceptions — access violations, divide by zero, etc. — and call the alert handler from within this function. This is an operating-system dependent procedure and is therefore not provided in the APG Library.

4.1.10 Memory Component

The main purpose of the memory component is to provide garbage collection. It has two constructors. The default constructor uses the CRT functions malloc() and free() as memory allocators. A second constructor allows the user to define a custom allocator to be used. Its destructor frees all memory that has been assigned by an instance of the memory component. Because it makes use of notoriously-slow linked lists, it should not be used in production loops. The APG library uses it for initial allocations and as a final clean up tool. The memory component also provides error reporting functions. Their purpose is to provide a history of error reporting in a memory-allocation-independent manner. Its main use is to keep a history of the call stack with error locations (file and line number) during an unwinding of the call stack. They makes no use of memory allocation so they may safely be used to report an out-of-memory error.


4.1.11 Vector Component

The purpose of the vector component is to provide a dynamic data buffer. That is, a data buffer to which data can be continuously added without needing to know in advance how much data the buffer will need to be able to hold. APG parsers make heavy use of the vector component. For efficiency, initial allocations should be selected which minimize the number of times the vector must actually re-allocate the data buffer.

4.1.12 Parser Component

The parser component is the APG parser. All other components exist solely to support the parser component. The parser generator produces relocatable parser data. Therefore, an initialization function, ulParserInit() — which is independent of any parser component context — must be called once per process before any parser component can be constructed. The constructor does not make a call to this function. The reason for this is to protect against multiple threads calling this function simultaneously. It should be called once per process at startup before any thread creation. The parser constructor will fail if ulParserInit() has not been previously called.

4.1.13 Catalog Component

The parser component provides a facility — ulParserSemanticAnalysis() — for doing a depth-first traversal of the AST. Other types of AST traversals are, of course, possible. The concept of a catalog is a means of traversing the AST horizontally at an arbitrary tree depth, with limited up and down movements. The catalog component provides the functions for realizing the catalog concept. Examples of its utility can be found in the sample applications.


4.1.14 Error Component

The error component is different and independent from the error history reporting done through the memory component. It is a higher-level component intended for use by applications that need text-annotated error and warning histories without stopping the application. The APG parser generator makes considerable use of this component. It should never be used to report a memory allocation failure.

4.2 APG Utilities

The utility functions are for the most part useful functions for displaying results, but unnecessary for the actual parsing of strings. The purpose of keeping them in a separate library is to keep all I/O functions out of the APG Library.


4.2.1 Display Utilities

These functions are ready-made displays of states, statistics, errors and debugging information.

4.2.2 File Utilities

Applications often need to determine the size of a file, allocate a buffer and read and write the file. Here are a few helpers to do just that.

4.2.3 IPv4 and IPv6 Hand-Written Parsers

A hand-written IPv6 parser really has nothing to do with APG in particular. But it was developed for use in the IPv6 sample application and may have general usage beyond that specific example. This just seemed like a good spot to save it off for future use.


5 Usage

5.1 The Grammar

For optimum parser performance it is important to limit the calling of the call back functions to only those of semantic interest — those for which semantic action code is actually written. This control is provided for in the format of the input grammar file. The grammar file is optionally built from three named sections — the grammar section, the syntax analysis control section and the semantic analysis control section. These sections are illustrated for the simple alphanumeric grammar used earlier.

alphanum = alpha *(alpha / digit)
alpha = "A' / "B" / "C"digit = "1" / "2" / "3"  [SECTION SYNTAX]


The grammar section contains the actual SABNF grammar for which a parser is to be generated. The syntax section contains a list of rule name call back functions to be called during syntax analysis and the semantics section contains a list of rule names to be called during semantic analysis. The meaning of the directives [INCLUDE], [EXCLUDE] and ALL_RULES should be apparent. If a rule name does not appear in the syntax section, it is not called during syntax analysis. Ditto for the semantics section. If a rule name appears in neither the syntax nor semantics section, no call back function is generated at all. In the example above only one call back function for alphanum is generated. No call back functions are called during syntax analysis and the AST will have only a single node for alphanum. If no section directives appear in the grammar file at all, the default is to generate call back function for all rule names.

5.2 The Parser Generator

5.2.1 Command Line Options

Options have the general form:

/name or

Except for the help option "?", all option names must begin with a slash, must be lower case and may not be abbreviated . Option values are case sensitive. Options not taking a value are true/false flags with a default value of "false" if the option is not present and "true" if it is present. The ordering of the options is irrelevant. If an option appears more than once, only the last (right-most) is recognized. If any option errors are detected, or if a help option is present, a help screen is printed to the console and the program terminates. "/in" is the only required option. All others are optional.

[top] help

?, /? and /h are all equally recognized as a help request option. A help screen is printed to the console and the program terminates. files

/in, /out and /log are the three file name options. File name lengths, including the null terminator and any file name extensions, must be less than 128 characters.

/in:filename specifies the input grammar file name. It is the only required option.

/out:parsername specifies the name to use for the parser. It is case sensitive and is used in several ways. It is used to define a uniquely named parser component or class and unique source code file name. For example, assume we've used the option

/out:MyParser. For a C-language parser component we get the file names MyParser.h and MyParser.c. MyParser.h will declare a variable vpMyParserInit which can be used to initialize and construct a parser component. For a C++ parser class, the file names will be MyParser.h and MyParser.cpp. MyParser.h will declare a parser class named ApgMyParser.

/log:filename specifies that all parser generator output is to be redirected to the named file. The default is to direct all output to the console screen. parser type

/ppt requests the parser component to use partially-predictive parsing tables. The default is to not use them.

/cpp requests a C++ parser class to be generated. The default output is a C-language parser component.

[top] recursion analysis

/startrule:int - The results of recursion analysis are dependent on the start rule. This option specifies which rule to use as the start rule. The default is int = 0 which is the first rule defined in the grammar. To select a different start rule, especially for large grammars, it may be necessary to run the generator once with the default to see how APG indexes the different rule names.

/sestmax:int - Recursion analysis requires a traversal of the syntax tree (SEST). For some grammars the SEST can have an extremely large number of nodes, causing the generator to appear to hang. This parameter is used to specify the cut off value to break out of the loop. If not present a default value of 1,000,000 is used.

/nr — no recursion analysis. This option will skip recursion analysis.

[top] displays

All display options begin with /d.

/dgrammar — Displays the input grammar with one-base line numbers and character count numbers. These are very useful for identifying grammar errors.

/dwarnings — Displays all grammar warnings. Implies /dgrammar since the warnings reference the line numbers.

/derrors — Displays all grammar errors. Implies /dwarnings (and hence /dgrammar.)

/dr — Displays recursion analysis of the start rule.

/drv — Displays verbose recursion analysis. Displays a recursion analysis for each rule as the start rule.

/dmetrics — Displays metrics or statistics for the grammar, SEST and PPTs if /ppt has been specified.

/dcharmap — Displays a character map. Prints a map of all alphabet symbol indexes, with a yes/no indication of whether the index is used or not by the grammar. If the index is in the range [32, 127] it prints the corresponding ASCII character.

/dv — Verbose display. Sets to "true" the options /dgrammar, /dwarnings, /derrors, /dmetrics and /dr.

/dopcodes — Displays an ASCII translation of each opcode. Primarily used as a generator debugging option, it may be instructive in some instances to examine the opcodes in detail. May generate an excessive amount of output.

/dppts — Displays a mapping of each PPT. Primarily used as a generator debugging option, it may be instructive in some instances to examine the PPTs in detail. May generate an excessive amount of output.


5.3 The Parser

The output of the parser generator is a header file (name.h) and a source code file (name.c or name.cpp) where "name" is the name supplied in the /out:name option. The header file defines identifiers for all of the rule names and a uniquely-named void pointer to be used for parser initialization and construction. Both the header and source files contain a number of


comment pairs. Any user code written between the comment pairs will be preserved by the parser generator on re-generation. This facilitates the iterative process of grammar debugging.

The parser function ulParserSyntaxAnalysis() also has a true/false flag to turn on a tracing function. This will print a trace of the syntax tree traversal during parsing. This can also be quite helpful for debugging, but can generate copious output. (If the input string is ASCII text with CRLF line enders, the utility function vDisplaySourceLines() can also be helpful in locating the line and character position of errors.)

The source code file contains the initialization data and call back functions. The initialization data is an opaque array of binary data. The string table, rule list, opcodes and PPTs are all there. The data is re-locatable, requiring a single call to an initialization function to "link" it to the running process before any parser component can be constructed.

The call back functions are do-nothing functions on the initial generation, but any user code written into them is preserved on re-generation. If the source is chosen to be C++ code, the source code file also has a constructor, destructor and member functions for syntax and semantic analysis. These functions should not be modified by the user. Additional documentation is in the APG library source code as template comments preceding each function. The seven sample applications should give sufficient examples of usage.


6 Sample Applications

Each of the seven sample applications is in its own directory in the release directory .\Apg5_0\Samples. Each directory has one or more "*.bnf" files that define the necessary grammars and a "CommandLine.bat" file with a script for building the application parser files. Each directory also has a "*.dsp" file defining a Microsoft Visual C++ 6.0 project. Sections 6.1 and 6.2 are simple demonstrations. They are hopefully a sufficient introduction to the use of APG 5.0 for the experienced programmer. Sections 6.3 — 6.7 examine the protocols for Internet media sessions and will hopefully serve as a starting point for application development.

6.1 Non-Context Free Grammars

The classic text book example of a non-context free grammar is the set of strings of the form anbncn . The SABNF grammar is

S = &(A !"b") *"a" B
A = "a" [A] "b"
B = "b" [B] "c"

A recognizes the context-free grammar string anbn, and B recognizes bncn. The first term, &(A !"b"), recognizes anbn but does not consume it. Without the !"b" term, the grammar would also recognize anbmcm, m > n. The term *"a" then consumes all of an. Finally, B recognizes bncn and consumes it.

6.2 C++ Parser Output

This is a simple example of how to generate and use the C++ form of the parser generator's output files.

6.3 IPv6 Internet Addresses

Each of the four protocols, URI, SDP, SIP and MEGACO, includes a grammar for the IPv6 Internet address. The grammar, though relatively small and simple has some kinks that were a challenge to overcome with the "prioritized-choice" disambiguation rule. It is a grammar that is both of significance and of small enough size to do some informative experimentation and comparisons with hand coding. Therefore, the published grammar (RFC 2373) has been re-written with a number of objectives:

  • correct published grammar errors
  • prioritized-choice disambiguation
  • capture all forms, including prefixes, compression and IPv4 addresses in the syntax analysis
  • capture all phrases in the syntax analysis, including validation of integer ranges
  • facilitate "semantic" look ahead — a semantic version of syntactic predicates

The example verifies the accuracy of the syntax-only parsing, and compares timing with and without PPTs, with semantic look ahead and hand coding. There are many reasons to prefer parser-generated parsers to hand-coded ones, but here is a powerful reminder that speed weighs heavily in favor of hand coding.

This application does four tests. The specific tests can be controlled from the command line. An empty command line runs all four tests by default. Otherwise, the numbers 1, 2, 3 or 4 will run only the numbered test.

Test 1: This is a parsing accuracy test. It will parse all of the addresses in the file "AddressesWithErrors.txt." This file contains a number of correct and incorrect addresses. They are parsed with four different parsers to verify the accuracy of the syntax analysis.

Test 2: This tests compares parsing statistics with and without semantic look ahead. For this test, Apg.lib must be built with _APG_CFG_DEBUG defined.

Test 3: This is a timing test and compares parsing times with among four different parsers. For this test, Apg.lib should be built with _APG_CFG_DEBUG undefined. Otherwise it can take a long time to run.

Test 4: This does a test of parsing with a hand-coded parser.

6.4 Universal Resource Identifiers (URI)

The Universal Resource Identifier (URI) is a fundamental part of most Internet protocols and it is worth while to have a well-defined and tested grammar and parser for them rather than constantly reproducing them with somewhat different definitions as sub-grammars.

6.5 Session Description Protocol (SDP)

The Session Description Protocol (SDP, RFC 4566) provides a standard representation of media details when initiating multi-media teleconferences, voice-over-IP calls, streaming video and other sessions. As such, an SDP parser is a required component of many other Internet protocol parsers. In this example, a prioritized-choice grammar is given, an APG parser is constructed and eight different examples of its use are illustrated. Each example is controlled from the command line with a single integer 1-8 as the only command line argument.

6.5.1 Syntax Accuracy

In addition to prioritized-choice, a number of other modifications and simplifications have been made to the published grammar to facilitate other parsing features — in particular, "lazy" parsing and sub-field parsing during semantic analysis. Therefore, it is important to establish that the grammar and APG parser perform syntax analysis correctly. A representative sample of syntactically correct and incorrect SDP announcements are parsed in full — that is, the full announcement is parsed with the full grammar — to demonstrate the accuracy of the grammar and parser.

6.5.2 Lazy Parsing

Applications often need only a few specific sub-fields of the full announcement. Lazy parsing is a technique for parsing only those lines and fields of interest, saving the unneeded effort of a full announcement parse. To this end, a simple, fast, hand-coded parser has been written which does nothing other than identify the location and type of the announcement lines. A specific individual line can then be parsed separately by selecting the rule for that line as the start rule and parsing only the text of the line. This technique is employed in the Session, Origin and Connection tests.

6.5.3 The Catalog Component

The Catalog Component is a new feature of APG version 5.0. It's purpose is two-fold. First, to enable a horizontal traversal of the AST. That is, if we are interested only in the input string phrase associated with a rule that is deep in the AST, we can proceed directly to it without having to start at the tree root and traversing down to it. Second is to identify by name phrases and sub-phrases without the need for an elaborate and grammar-specific series of hand-coded structures. Since many rule names define lists of phrases rather than a simple, single phrase, the catalog defines iterators over the lists, even if there is only one phrase in the list. The Origin, Media, Connection and URI tests demonstrate the use of the Catalog Component. (For more discussion of the Catalog Component, see section 6.6 below.)

6.5.4 Parsing sub-fields during AST-directed translation

Some rule names define phrases that are relatively complex in their own right. The IP address and URI are two such examples. Therefore, it is useful to be able to use well-defined, reusable parsers for these fields. The technique is to parse out these phrases as simple text fields and only perform a detailed parse if and when needed during semantic analysis. That is, during the AST-directed translation of the announcement. The Connection and URI tests demonstrate this technique for the IP address and URI phrases.

6.5.5 Timing comparisons of parsers with and without PPTs

The Timing test does a direct comparison of full-announcement parsing between APG parses with and without PPTs. For SDP announcements, PPTs improve the parsing speed by a factor of nearly 3.

6.5.6 Node statistics comparisons of parsers with and without PPTs

The Statistics test must be run with a parser and APG library built with the _APG_CFG_DEBUG macro defined. This is simply and informational test to see detailed comparisons of node visits to the various types of syntax tree nodes for APG parsers with and without PPTs.

6.6 Session Initiation Protocol (SIP)

"There are many applications of the Internet that require the creation and management of a session, where a session is considered an exchange of data between an association of participants" [RFC 3261]. Over the past 5-6 years the Session Initiation Protocol (SIP, RFC 3261) has emerged as the protocol of choice for voice, video, instant messaging and data in many industries. This example, therefore, examines the parsing of SIP messages in some detail. In addition to simply demonstrating the use of APG, several other aspects of the SIP problem are examined.

Although I don't consider myself to be a SIP expert by any means, I have had the opportunity to work on a couple of different commercial SIP applications. One problem in particular that stood out for me is that they have all had large files of tediously-written structures, one for each SIP message header and with each structure containing SIP-specific members named and typed for the bits and pieces of data that each contains. Complicating matters further is that many of the headers, as well as their bits and pieces, can be present as lists of zero or more occurrences. The APG Catalog Component was developed specifically to address this problem. The catalog organizes the phrases associated with any rule name as iterators over the list of occurrences and provides pointers to the phrase's parent as well as an iterator over its immediate children. This allows access to each message header in a direct fashion with access to each bit and piece of header information.

Although the SIP problem was the motivator for the Catalog Component, it is a protocol-independent tool and can be used to navigate the AST generated by a parser for any grammar. The highlights of this sample application are:

  • PPTs can speed the parsing of SIP messages by a factor of more than 4
  • "lazy" parsing can speed the parsing of a SIP message by a factor of 1.4, even when all message headers are parsed in their entirety
  • APG and its unambiguous, prioritized-choice grammar correctly parse all of the syntactically-correct messages from the SIP "torture tests" [RFC 4475]
  • All occurrences of all bits and pieces of all message headers are easily located, identified and displayed for each SIP message using the Catalog Component and no specialized structures
  • No semantic analysis and no AST-directed translation was required (the AST is required for construction of the Catalog Component, however)
  • The size of the full SIP test application, including three different SIP parsers is 292KB

This application does five different tests, controlled with a single integer 0-4 on the command line.

Test 0: is not really a test, but runs a function which translates the torture test examples from the RFC 4475 format to a binary file of actual SIP messages.

Test 1: test the pre-parser which prepares a message for lazy parsing.

Test 2: test for syntax analysis accuracy when parsing messages in full.

Test 3: test for syntax analysis accuracy when lazy parsing individual message header values.

Test 4: does a timing test comparison of the various parsing method. Should be run with a build of Apg.lib done with _APG_CFG_DEBUG undefined. Otherwise, it can take a long time.

6.7 Media Gateway Control Protocol (MEGACO)

Many of the e-mails I have received have been from developers around the world needing a MEGACO[8] parser. In fact, APG-generated MEGACO parsers have been used in a number of commercial applications. This example simply provides a prioritized-choice MEGAGO grammar and an APG-generated parser and gives a timing comparison between standard and PPT versions. Use of PPTs speeds the parsing by a factor of approximately 2.7. This application does two tests, controlled with integers 1 or 2 on the command line.

Test 1: Does a timing comparison of parsers with and without PPTs. Should be run with a build of Apg.lib done with _APG_CFG_DEBUG undefined.

Test 2: Compares parsing statistics of parsers with and without PPTs. Must be run with a build of Apg.lib done with _APG_CFG_DEBUG defined.


7 References

1.  Alfred V. Aho, Ravi Sethi and Jeffrey D. Ullman, 
    "Compilers: Principles, Techniques, and Tools",
    Addison-Wesley, 1986 and the 2nd Edition, 2007, by 
    Alfred V. Aho, Ravi Sethi, Monica S. Lam and Jeffrey D. Ullman
2.  John E. Hopcroft and Jeffrey D. Ullman, 
    "Introduction to Automata Theory, Languages and Computation", 
    Addison-Wesley, 1979
3.  Bryan Ford, 
    "Parsing Expression Grammars: 
    A Recognition-Based Syntactic Foundation", 
    Proceedings 31st ACM Symposium on 
    Principles of Programming Languages, 
    pp. 111-122, 2004
4.  RFC 4234, The Internet Engineering Task Force (IETF) 
    Requests for Comment:4234, 
    See the IETF repository page, http://www.ietf.org/rfc.html
5.  ISO/IEC 14977 : 1996(E)
6.  Terence J. Parr and Russell W. Quong, 
    "Adding semantic and syntactic predicates 
    to LL(k) – pred-LL(k)", 
    In Proceedings of the International Conference on 
    Compiler Construction, 
    Edinburgh, Scotland, April 1994 
    (This is generally cited as the first use of 
    syntactic predicates. Thanks to the anonymous 
    referees of a manuscript I wrote for pointing this 
    relationship out to me.)
7.  Bryan Ford, 
    "Packrat parsing: Simple, powerful, lazy, linear time", 
    in Proceedings  of the 2002 International Conference on 
    Functional Programming,
    Oct. 2002.
8.  ITU-T Recommendation H.248.1, version 3, Sept. 2005.
9.  George H. Roberts, 
    "Recursive Ascent: An LR Analog to Recursive Descent", 
    SIGPLAN Notices, Vol. 23, No. 8, pp. 23, 1988
10. Larry Morell and David Middleton, 
    "Recursive-ascent parsing", 
    Journal of Computing Sciences in Colleges, 
    Vol. 18, No. 6, pp. 186, 20038           

8 Appendix A ABNF for SABNF


8.1 Lexer Grammar

; Pre-processing grammar - tokenize the RFC 4234 grammar file 
; - symbol alphabet is ASCII characters 9, 10, 13 & 32-127 
; Deviations from RFC 4234 
; 1) grammar file must be null terminated (last character must be 0 (zero)) 
; 2) Lines may end in CR, CRLF or null if last line 
; 3) Rule definitions must be flush left 
;    - start in first character of new line 
; 4) Horizontal tabs are assigned a tab value of one 
;    - they are converted to a single space 
; 5) A line continuation is treated as a single space 
;    - it may form a concatenation operator 
; Prior to parsing the ASCII grammar file must have the following preparation 
; 1) All characters verified as valid grammar characters. 
;    - only CR(%d13), LF(%d10), tab(%d9) and printing characters (%d32-127) 
; 2) All horizontal tabs (%d9) converted to spaces (%d32). 
; 4) The last line must end in CR or CRLF. 
;    - LF added if not present 
; NOTE: using the prioritized-choice disambiguation rule 
;       - the ordering of the alternatives is very important 
File          = *Line 
Line          = Blank / 
                Rule  / 
Blank         = *%d32 [comment] BlankLineEnd 
LineError     = *(%d32-127) LineEnd 
Rule          = (NameDef/NameDefError) 
                 *wsp (DefAlt/Def/DefError) 
                 *wsp *Token *wsp LineEnd 
NameDef       = alpha *(alpha/%d48-57/%d45) ; -> TOK-NAMEDEF 
DefAlt        = %d61.47                  ; "="           ; -> TOK-DEF   
Def           = %d61                     ; "=/"          ; -> TOK-DEFALT 
NameDefError  = 1*%d33-127               ; any but space ; -> TOK-ERROR 
DefError      = 1*%d33-127               ; any but space ; -> TOK-ERROR 
OpAlt         = *wsp %d47 *wsp           ; "/"           ; -> TOK-ALT 
OpNot         = %d33                     ; "!"           ; -> TOK-NOT 
OpAnd         = %d38                     ; "&"           ; -> TOK-AND 
OpRparen      = *wsp %d41                ; ")"     ; -> TOK-RPAREN 
OpRoption     = *wsp %d93                ; "]"     ; -> TOK-ROPTION 
OpCat         = 1*wsp                    ; concatenation ; -> TOK-CAT 
OpLparen      = %d40 *wsp                ; "("     ; -> TOK-LPAAREN 
OpLoption     = %d91 *wsp                ; "["     ; -> TOK-LOPTION 
OpRep         = %d42                     ; "*"           ; -> TOK-REP 
OpNumber      = 1*%d48-57                ; -> TOK-NUM 
dRange        = OpDec dmin %d45 dmax     ; %d0-9         ; -> TOK-TRG 
dString       = OpDec dnum 1*(%d46 dnum) ; %d0.9         ; -> TOK-TBS 
dChar         = OpDec dmin               ; %d0-9         ; -> TOK-TRG 
xRange        = OpHex xmin %d45 xmax     ; %x0-F         ; -> TOK-TRG 
xString       = OpHex xnum 1*(%d46 xnum) ; %x0.F         ; -> TOK-TBS 
xChar         = OpHex xmin               ; %x0-F         ; -> TOK-TRG 
bRange        = OpBin bmin %d45 bmax     ; %b0-1         ; -> TOK-TRG 
bString       = OpBin bnum 1*(%d46 bnum) ; %b0.1         ; -> TOK-TBS 
bChar         = OpBin bmin               ; %b0-1         ; -> TOK-TRG 
Name          = alpha *(alpha/%d48-57/%d45) ; -> TOK-NAME 
OpTls         = %d34 *(%d32-33/%d35-127) %d34 ; "text"  ; -> TOK-TLS 
ProsVal       = %d60 *(%d32-61/%d63-127) %d62 ; <text>  ; -> TOK-ERROR 
TokenError    = 1*%d33-127                    ; any but space ; -> TOK-ERROR 
Token         = OpAlt     / 
                OpAnd     / 
                OpNot     / 
                OpRparen  / 
                OpRoption / 
                OpCat     / 
                OpLparen  / 
                OpLoption / 
                OpRep     / 
                OpNumber  / 
                dRange    / 
                dString   / 
                dChar     / 
                xRange    / 
                xString   / 
                xChar     / 
                bRange    / 
                bString   / 
                bChar     / 
                OpTls     / 
                Name      / 
                ProsVal   / 
OpDec         = %d37 (%d68/%d100)             ; %d 
OpHex         = %d37 (%d88/%d120)             ; %h 
OpBin         = %d37 (%d66/%d98)              ; %b 
dmin          = 1*(%d48-57) 
dmax          = 1*(%d48-57) 
dnum          = 1*(%d48-57) 
bmin          = 1*%d48-49 
bmax          = 1*%d48-49 
bnum          = 1*%d48-49 
xmin          = 1*(%d48-57 / %d65-70 / %d97-102) 
xmax          = 1*(%d48-57 / %d65-70 / %d97-102) 
xnum          = 1*(%d48-57 / %d65-70 / %d97-102) 
alpha         = %d97-122/%d65-90 ; A-Z or a-z 
wsp           = %d32    / 
                comment / 
comment       = %d59 *(%d32-127) 
LineContin    = (%d13.10/%d13) 1*%d32 
LineEnd       = %d13.10/%d13                  ; -> TOK-LINEEND 
BlankLineEnd  = %d13.10/%d13 


8.2 Parser Grammar

; ABNF for SABNF - generate v5.0 APG opcodes 
; symbol alphabet is SABNF tokens generated by lexer (pre-parser) 
File          = *(Rule/RuleError) 
                 Alternation TOK-LINEEND 
RuleError     = *(%d1-20) TOK-LINEEND 
Alternation   = Concatenation *(TOK-ALT Concatenation) 
Concatenation = Repetition *(TOK-CAT Repetition) 
Repetition    = Rep       / 
                Repeat    / 
                Predicate   / 
Rep           = dmin [max-reps] Element / 
                TOK-REP dmax Element 
Repeat        = TOK-REP Element 
NoRep         = Element 
Predicate     = (TOK-AND / TOK-NOT) Element 
dmin          = TOK-NUM 
dmax          = TOK-NUM 
max-reps      = TOK-REP [dmax] 
Element       = TOK-TRG / 
                TOK-TLS / 
                TOK-TBS / 
                TOK-RNM / 
                Group   / 
Group         = TOK-LPAREN  Alternation TOK-RPAREN 
Option        = TOK-LOPTION Alternation TOK-ROPTION 
TOK-NUM     = %d1 
TOK-DEF     = %d3 
TOK-DEFALT  = %d4 
TOK-LPAREN  = %d5 
TOK-RPAREN  = %d6 
TOK-MIN     = %d9 
TOK-MAX     = %d10 
TOK-RNM     = %d11 
TOK-ALT     = %d12 
TOK-CAT     = %d13 
TOK-REP     = %d14 
TOK-AND     = %d15 
TOK-NOT     = %d16 
TOK-TRG     = %d17 
TOK-TBS     = %d18 
TOK-TLS     = %d19 
TOK-ERROR   = %d20 

top of page