headerphoto

APG - An ABNF Parser Generator

Version 6.3 Description
(Released June 29, 2012)

Summary

APG Version 6.3 is a complete re-write of Version 6.1. Its primary new features are:

  • 64-bit operating systems.
  • Variable-width alphabet character codes - 8, 16, 32 and 64 bits wide.
  • User-Defined Terminals (UDTs).
  • Additional profiler optimizations.
  • Improved attributes algorithm.

The advantages of these are, briefly:

  • Version 6.3 parsers can be generated and run on both 32- and 64-bit operating systems.
  • Previous versions of APG limited the alphabet character codes to 8 bits. With the new, expanded widths it is now possible, for example, to parse UNICODE characters in their simplest, non-encoded UTF-32 form.
  • UDTs are discussed in detail below.
  • Careful profiling and the indicated bottle neck optimizations alone increased the parsing speed by approximately 25%.
  • The improved attributes algorithm cleaned up some time consuming inefficiences of the previous version. For example, the attributes computation for the MEGACO grammar could take up to 30 minutes. That's not a misprint - minutes. The new algorithm does it in milliseconds.

To understand the advantages and disadvantages of UDTs it is necessary to understand the role of the UDT operator and the UDT callback functions in APG. These will be discussed in the next two sections followed by a discussion of the related rule callback functions. Since rule and UDT callback functions are what give APG most of its power, a good understanding of both is necessary.

The discussions below will be brief. It is assume that the reader has read reference [1] at a minimum. And preferably also ref. [2] and or ref. [3] or equivalent and ref. [4].

[top]

1.0 The Parsing Process

For context, let me first give a decidedly non-mathematical, simplistic view of recursive-descent parsing. Grammars define the formal language sentences and the characters and phrases that the sentences are made up from. For example: SENTENCE = (RED / GREEN) (PAPER / LEAVES)
RED = "red "
GREEN = "green "
PAPER = "paper"
LEAVES = "leaves"

This grammar defines:
alphabet: {" ", "a", "d", "e", "g", "l", "n", "p", "r", "s", "v"}
or in ASCII alphabet character codes (lower and upper case)
{32,65,68,69,71,76,78,80,82,83,86,97,100,101,103,108,110,112,114,115,118}
phrases: {"red ", "green ", "paper", "leaves"}
sentences: {"red paper", "red leaves", "green paper", "green leaves"}

Grammars can also be thought of as defining a syntax tree made up of terminal and non-terminal nodes, shown in Figure 1. Figure 1.
Figure 1.
A recursive-descent parser makes a "depth-first" traversal of the nodes. That is, beginning at the top, root node, the traveral moves down to the left-most, previously unvisited node. From the terminal nodes the direction is reversed upward to the node's parent. In this way, each non-terminal node is visited at least twice, once from above and once or more from below.

The nodes of the syntax tree can be thought of as parsing operators. That is, the operation at each node is to give the parser instructions on a) what to do and b) where to go next. For example, when approached from above each non-terminal node simply instructs the parser to proceed on to the left-most, unvisited node below. When approached from below each non-terminal operator has its own special operation to do. Simply put, ALT nodes select from alternate phrases and CAT nodes concatenate phrases. RNM nodes are simply way stations in the syntax tree which give the user the opportunity to examine and process a few of the intermediate phrases that the parser generates. This important processing is done in APG with rule name callback functions and is discussed in the Rule Name Callbacks section below.

But terminal nodes have much simpler operations. They are only approached from one direction, namely above. What to do? Try to match its pre-specified phrase. Where to go next? Return to the parent. That's it.

In general then, terminal nodes are phrase recognizers and non-terminal nodes are phrase manipulators.

[top]

2.0 User-Defined Terminals (UDTs)

Let's now slightly rewrite the example grammar above.
SENTENCE = (RED / GREEN) (PAPER / u_LEAVES)
RED = "red "
GREEN = "green "
PAPER = "paper"

The ABNF terminal LEAVES has now been replaced with the User-Defined Terminal u_LEAVES. Since the underscore is not a valid character in ABNF, the parser generator has no trouble recognizing it as a UDT(*). And since it is user written it needs no ABNF rule to define it.

The new syntax tree now has a new operator UDT, as shown in Figure 2. Figure 2.
Figure 2.
As the name implies, this is a terminal node. Its only operation is to call a user-written, phrase-recognition function and return the result to the parent node. The User's Guide has details on the prototype for these functions and how to identify them to the Parser.

u_LEAVES has some "safe" uses and some "unsafe" uses. Both are very powerful enhancements to the parser.

A "safe" use would be to write it so that it simply recognizes the phrase "leaves" only is optimized to do it much faster than the ABNF terminal does. This use does not change the language that the grammar defines in any way. It just does it much faster. For example, in the SIP example released with the APG code I've managed to increase the parsing speed by a factor of 10 without changing in any way the language the original ABNF grammar defined.

An "unsafe" use would be, for example, to have it look backwards in the sentence string and if it sees "red" then it only recognizes "faced" and if it sees "green" it only recognizes "cheese". That is, it completely alters the language that the parser recognizes in a context-sensitive way.

A search for discussions of handwritten parsers will turn up many, many criticisms of them. The primary complaint is that there is generally no way to know what, exactly, the language is that is being parsed. But handwritten parsers also have two important advantages. The first being that handwritten parsers are usually much faster. The second is that they are capable of parsing non-Context-Free Grammars. For probably both of these reasons the GNU gcc compiler is, in fact, a handwritten, recursive-descent parser[gcc]. And it is for both of these reasons that UDTs have been introduced to APG.

To recap, there are three types of uses envisioned for the UDTs.

  • "safe" - to simply speed up the parsing process.
  • "unsafe" - to incorporate a few non-CFG language requirements into a grammar that is mostly CFG.
  • "totally dangerous" - to simply provide a recursive-descent backbone to a completely handwritten parser. UDTs are normally terminal nodes. However, there is a backdoor in APG to let them move on down the syntax tree. See the User's Guide for the Parser functions vExecuteRule() and vExecuteUdt(). You should really know what you are doing before using them, but you could build a parser for a completely arbitrary language with no ABNF grammar to guide it at all.

(*) UDT names can begin with either the "u_" or "e_" prefix. The difference between them is important. In order for the parser generator to determine the grammar's attributes it is very important for it to know which rules accept empty strings and which don't. Since UDTs are user written, there is no way for the parser generator to know whether a given UDT will or won't accept empty strings. Therefore, the convention has been adopted that only UDTs whose names begin with "e_" are allowed to accept empty strings. Those beginning with "u_" are not allowed to accept empty strings. This convention is enforced by the parser. That is, if a user-written function, u_myfunc() for example, returns a phrase length of zero, the Parser will fail with a terminal error and exit with a call to the currently installed alert handler.

[top]

3.0 Rule Name Callbacks

Rule name callback functions also provide a lot of extra parsing power to APG. The RNM nodes will call a user-written function twice for each node, first on the downward traversal of the node before any parsing of the phrase has been done and then again on the upward traversal after the Parser's phrase recognition work for the node is done. There are several types of actions that rule callback functions can accomplish.

  • downward traversal (pre-parsing of the branch below the node): Administrative work meaning any type of table or data buffer set up or similar data handling that does not affect the parsing of the string. Side calculations that are done parallel to the parsing process.
  • downward traversal: Override the Parser completely. That is, do your own phrase matching and skip the parsing completely. In this case, the rule callback us functioning as a UDT. However, implementation of a UDT in this manner is not obvious and a little clumsy. UDTs were developed to overcome this shortcoming.
  • upward traversal (post-parsing of the branch below): Again, any administrative work that might be convenient to do parallel to the parsing process.
  • upward traversal: Examine the Parsers's phrase matching result and decide whether to accept or reject it. For example, there may be non-CFG requirements that need to be enforced. This might be done by outright rejection, returning no match instead of match. Or you have another chance here for overriding the phrase recognition completely, and returning some phrase other than what the Parser found.

Rule name callback functions are written and identified to the Parser in exactly the same manner as UDTs. See the User's Guide for details. Especially the Demo example.

[top]

4.0 Attributes

One of the first things you learn about recursive-descent parsers is that you can't parse a left-recursive grammar.
Start = Start "s" / "y" The parser will recurse infinitely (or more practically, until a stack overflow exception is issued) before it ever recognizes a single character of a single phrase. Left recursiveness is considered here to be an attribute of the grammar. It is an important attribute to know about, since it is a fatal flaw in the grammar. But there are other, equally important attributes.

In the following two sections below I'll define the attribute types and the recursion types which you will need to understand some of the information displayed by the parser generator. However, if you want to know the nitty gritty of how it is all done you will have to read the code.

[top]

4.1 Attribute Definitions

  • Left Recursiveness (fatal) -
    Start = Start "s" / "y" Fatal as discussed above.
  • Right Recursiveness -
    Start = "s" Start / "y" Not fatal but nice to know.
  • Nested Recursiveness -
    Start = "s" Start "s" / "y" Not fatal but important to know, since it is what distiguishes regular from context-free grammars. If this attribute is false the grammar is regular, if true the grammar is context-free.
  • Cyclic (fatal) -
    Start = Start Valid grammar syntax but defines no phrases.
  • Infinite (fatal)-
    Start = "s" Start This defines only infinite string phrases. Therefore, the parser will fail for every finite input string or sentence.
  • Empty -
    Start = "s" Start / "" The language described by this grammar includes empty strings. Not fatal, but knowing whether a rule's phrases can be empty or not is critical for determining the recursive attributes.

Note that it is possible for a rule to have more than one true attribute at the same time. For example, the infinite and empty rules above are also right-recursive. The other important thing to know is that these are aggregate attributes. That is, they only mean that an attribute can happen, not that it always does. For example, the empty example grammar can accept the empty string "", but it can also accept the non-empty string "s".

[top]

4.2 Types of Recursion

In the parser generator's displayed results you will see, in addition to the attributes, the rule names listed under several different types. These are:

  • N_RECURSIVE - non-recursive. The rule never refers to itself.
  • R_RECURSIVE - recursive. The rule refers to itself.
  • MR_RECURSIVE - mutually recursive. Sets of rules. Within each set, each rule refers to every other rule in the set including itself.
  • NMR_RECURSIVE - non-recursive but refers to one or more mutually-recursive rules.
  • RMR_RECURSIVE - recursive but also refers to one or more mutually-recursive rule sets of which it is not a member.

The reasons for these distinctions in recursive rule types is that they play an important role in the determination of the attributes. In particular, the mutually recursive sets create a catch-22 situation in that they present the seemingly impossible task of needing the attributes of one member of the set in order to determine the attributes of the others and vice versa. There is a way around this, which I won't go into here. But since the determination of which rules fall into which type must be done to determine the attributes, I've included an option in the parser generator to display them. If you are not interested in attributes do not use the /dattributes command line option.

[top]

5.0 User's Guide

The User's Guide is an HTML document that can be accessed here. [User's Guide]
If you will be using it often you should bookmark this link.

The User's Guide document has been generated automatically from specially formatted comments in the code itself using the doxygen document generator for C/C++ code.

Clicking on the "Directories" link in the User's Guide header will display links to the various APG sections. These are briefly described here.

  • ApgLib: Documentation of the Parsing library and the UDT library. Note that due the the fact that the alphabet character width must be determined at compile time, ApgLib is compiled separately with each project. A static ApgLib library is not generated.
  • ApgUtilities: Documentation of the APG Utilities library. The ApgLib library contains no references to the C Run-Time Library I/O functions (so long as none of _APG_CFG_DEBUG, _APG_CFG_TRACE or _APG_CFG_STATS are defined.) The APG utilities are primarily display functions but also contains some basic file I/O helpers and a timer component used in the timing tests.
  • Generator: Documentation for using the parser generator to generate C or C++ parsers from your SABNF grammar.

The remaining directories are for the examples of usage. You can copy or examine the code in these examples to get yourself up and running with your own parser.

  • Demo: Simple demonstrations of setting up the basic operations, adding your own rule and UDT callback functions and using the UDT functions in the UDT library.
  • CppDemo: How to generate and use a C++ parser.
  • MEGACO: Many APG users use it in their MEGACO parsers. This is an example of how to set up a MEGACO C++ parser.
  • SIP: One of my goals for UDTs was to gain a 10-fold parsing speed increase for a commonly used grammar of some significance. The Session Initiation Protocol (SIP, RFC 3261) easily meets that standard and I have been successful in achieving the 10-fold increase. (10-fold in the debug version and close to 10 in the release, compiler-optimized version.)
  • WideCharacters: A demonstration of using an alphabet character width greater than 8 bits. This example parses UNICODE in its unencoded format UTF-32. That is, the alphabet characters codes are the full 32-bit UNICODE values.

[top]

5.1 Setting Up and Using the APG Distribution

Files:
  APG Verson 6.3 is distributed in four different compression formats:
  apg-6.3.tar.gz, 
  apg-6.3.zip,
  apg-6.3.tar.bz2 and
  apg-6.3.tar.lzma.

  Decompression of any of these will create the file structure:
  ./apg-6.3/ApgLib
  ./apg-6.3/ApgUtilities
  ./apg-6.3/CppDemo
  ./apg-6.3/Demo
  ./apg-6.3/Generator
  ./apg-6.3/MEGACO
  ./apg-6.3/SIP
  ./apg-6.3/WideCharacters

  ./ApgLib and ./ApgUtilies are library files. They are not built as static libraries
  as in previous versions due to the character width being a compile-time choice. They
  are instead to be compiled separately with each project.

  ./Generator has the source code files for the parser Generator. It builds the executable "apg".

  The remaining directories have source code for the accompanying usage examples. They can be
  executed, examined, modified and used as examples of how to set up your project and
  use the ApgLib API.

  ./Generator and each example contain *.bnf files for all of the SABNF grammars used
  and the bash script files "runtests" and "generateGrammars".
  "runtests" will execute the example after it has been built.
  "generateGrammars" will regenerate all of the parsers in the example.
  For operating systems other than Linux these scripts will have to be translated
  into the scripting language of your choice.

  ./CppDemo and ./MEGACO are examples of how to use the C++ parsers. All other examples use C.

  ./Demo is a "Hello World" example illustrating most of the API features including the new UDTs.

  ./WideCharacters is an example of using 32-bit characters.

  ./SIP is an extensive study of the SIP (RFC 3261) grammar. It successfully attempts
  to achieve a 10-fold improvement in parsing speed through the use of UDTs for the simpler
  CFG phrase constructs such as white space and alphanum.

Installation:

  1. Linux:
     See the file INSTALL for general Linux installation instructions.

     Specifically, use the commands:
     tar -xzf apg-6.3.tar.gz
     cd apg-6.3
     ./configure
     make
     make install

     The installation will install
     /usr/local/bin/apg

     apg is the parser generator. For usage, execute "apg --help" or
     see the documentation at www.coasttocoastresearch.com.

  2. Code::Blocks:
     The file ./apg-6.3/apg.workspace is a Code::Blocks workspace file for the entire system.
     Code::Blocks is Open Source (http://www.codeblocks.org/) and runs interchangeably
     on 32- and 64-bit Linux as well as Windows.

     run Code::Blocks
     open the workspace file ./apg-6.3/apg.workspace
     from the "Build" menu select "Build workspace"

  3. Other:
     Use the IDE or build process of your choice.
    

[top]

Appendix A: References

[1] Alfred V. Aho, Ravi Sethi, Monica S. Lam and Jeffrey D. Ullman, 
    "Compilers: Principles, Techniques, and Tools",
    2nd Edition, Addison-Wesley, 2007
    
[2] John E. Hopcroft and Jeffrey D. Ullman, 
    "Introduction to Automata Theory, Languages and Computation", 
    Addison-Wesley, 1979
    
[3] R Gregory Taylor, 
    "Models of Computation and Formal Languages", 
    Oxford University Press, 1998
    
[4] Bryan Ford, 
    "Parsing Expression Grammars: 
    A Recognition-Based Syntactic Foundation", 
    Proceedings 31st ACM Symposium on 
    Principles of Programming Languages, 
    pp. 111-122, 2004

[top]

Appendix B: The ABNF for SABNF Grammar

; ABNF for APG 6.3 SABNF
; Symbol alphabet is ASCII (code points 9, 10, 13, 32-126).
; SABNF variations from RFC 4234:
;  restrictions:
;    incremental alternates, rule2 =/ rule1, are not allowed
;    prose values, <prose value>, are not allowed
;    only the first successful alternate, left to right, is accepted (the "prioritized choice" rule)
;    only the longest possible match to repetitions is accepted
;
;  relaxations: 
;    forgiving line endings - CRLF / LF / CR
;
;  super set additions:
;    syntactic predicate operators, & and !, have been added
;    User-Defined Terminals of the form, u_rule and e_rule, have been added
;
File          = *(BlankLine/Rule/RuleError)
BlankLine     = LineEnd / (%d59/sp) *(sp / %d33-126) LineEnd
Rule          = RuleName owsp (IncAlt/DefinedAs) owsp Alternation owsp LineEnd
RuleError     = *(sp / %d33-126 / LineContinue) LineEnd
Alternation   = MultipleAlt / SingleAlt
SingleAlt     = Concatenation
MultipleAlt   = Concatenation 1*(AltOp Concatenation)
Concatenation = owsp (MultipleCat / SingleCat)
SingleCat     = Repetition
MultipleCat   = Repetition 1*(CatOp Repetition)
Repetition    = Rep / And / Not / Element
Rep           = RepOp Element
And           = AndOp Element
Not           = NotOp Element
Element       = TrgOp   /
                TbsOp   /
                TlsOp   /
                UdtOp  /
                RnmOp   /
                Group   /
                Option  /
                ProsVal
Group         = Lparen  Alternation Rparen
Option        = Loption Alternation Roption
ProsVal       = %d60 *(%d32-61/%d63-126) %d62
DefinedAs     = %d61
IncAlt        = %d61.47

RuleName      = alphanum
RnmOp         = alphanum
UdtOp         = udt-empty / udt-non-empty
udt-non-empty = %d117.95 alphanum
udt-empty     = %d101.95 alphanum
RepOp         = (rep-min StarOp rep-max) /
                (rep-min StarOp)         /
                (StarOp rep-max)         /
                 StarOp                  /
                 rep-min-max
AltOp         = owsp %d47
CatOp         = wsp
StarOp        = %d42
AndOp         = %d38
NotOp         = %d33
TrgOp         = %d37 ((Dec dmin %d45 dmax) / (Hex xmin %d45 xmax) / (Bin bmin %d45 bmax))
TbsOp         = %d37 ((Dec dString *(%d46 dString)) / (Hex xString *(%d46 xString)) / (Bin bString *(%d46 bString)))
TlsOp         = Ltls *TlsChar Rtls
TlsChar       = %d32-33/%d35-126 ; tab not allowed

Lparen        = %d40 owsp
Rparen        = owsp %d41
Loption       = %d91 owsp
Roption       = owsp %d93
rep-min       = rep-num
rep-min-max   = rep-num
rep-max       = rep-num
rep-num       = 1*(%d48-57)
Ltls          = %d34
Rtls          = %d34
dString       = dnum
xString       = xnum
bString       = bnum
Dec           = (%d68/%d100)
Hex           = (%d88/%d120)
Bin           = (%d66/%d98)
dmin          = dnum
dmax          = dnum
bmin          = bnum
bmax          = bnum
xmin          = xnum
xmax          = xnum
dnum          = 1*(%d48-57)
bnum          = 1*%d48-49
xnum          = 1*(%d48-57 / %d65-70 / %d97-102)

; Basics
alphanum      = (%d97-122/%d65-90) *(%d97-122/%d65-90/%d48-57/%d45)
owsp          = *space
wsp           = 1*space
sp            = %d9 / %d32
space         = sp      /
                comment /
                LineContinue
comment       = %d59 *(sp / %d33-126)
LineEnd       = %d13.10 / %d10 / %d13
LineContinue  = (%d13.10 / %d10 / %d13) %d32

[top]