Interactive APG Help
The ABNF Grammar
The Input String
Configuring the Generator
Configuring the Parser
Understanding the Output
Tracing the Parser
The Traced Records
The Parser Filter
The Display Filter
The Interactive APG Examples
Appendix - ABNF for SABNF Grammar
Interactive APG was designed to help write and debug ABNF grammars and the language sentences they define. APG grammars are defined here. If you are new to recursive-descent parsing a good place to start is this Wikipedia synopsis and the references it mentions. You can learn more about how APG works here.
The layout of the Interactive APG page is intended be intuitive enough that anyone experienced with using parser generators should be able to look at a couple of examples and be off and running. But here are a few notes for new and experienced users alike.
- Getting Started: The tutorials are intended to give a quick introduction to the basic control of Interactive APG. The User's Guide gives a complete, though rather dry, description of all of the windows, buttons and configuration controls.
The Interactive APG page has three display windows. The first is a text area for entering the SABNF grammar you want to check. The second is a text area for entering the language sentence you want to parse. The third area is a read-only window for displaying results. Above each of the three windows there is a Full Screen button and above the top two windows are also Select and Help buttons.
Full Screen: Pops up a full–screen display of what is in the window. It is a static snapshot of the window's contents. It is not dynamically updated if the window's contents change.
Select: Selects all of the text in the window.
Help: Pops up this help page.
The top window, labeled "ABNF Grammar" is for grammar checking. Simply type or paste your grammar into this window and click the Generate button.
If the grammar syntax is not correct the x indicator will be displayed in red and the "Input String" buttons are disabled. The Message Log will be displayed in the output window below with error messages. The error messages will, in most cases, be sufficient to locate the grammar's syntax errors.
If the grammar syntax is correct the √ indicator will be displayed in green, a parser for the grammar is generated and the buttons above the "Input String" window are enabled.
The second window, labeled "Input String" is for checking language sentences, usually referred to here as simply the input string, against the above grammar. Simply type or paste your string into this window and click the "Parse" button.
If the string is not a valid language sentence the x indicator will be displayed in red and the parser's final state will appear in the output window below. The error messages, if any, will not be very useful in this case. To debug input string errors, see Tracing the Parser below.
If the string is a valid language sentence the √ indicator will be displayed in green.
Every attempt has been made to make the syntax checking and error reporting of the generator sufficient for grammar debugging. Therefore, there should be no need to do any tracing or otherwise configuring of it. However, the generator is itself an APG parser and you can if you like, with the help of the ABNF grammar in the appendix, trace and/or otherwise configure and analyze its output just the same as you would for the parser. To do so, refer to Configuring the Parser below. The only difference is that since input to the generator is always ASCII there is no "Input Mode" option.
Click the "Configuration" button above the "Input String" window and a configuration window will pop up. At the top and bottom of the window are three buttons, Save, Cancel and Reset which will perform the obvious tasks.
Save: Clicking Save will save the selected configuration and apply it to the next click of the Parse button.
Cancel: Clicking Cancel will close the window, ignoring any selection changes that may have been made.
Reset: Clicking Reset will set all window selections to the original default configuration and close the window.
Display Mode: Your choice here will determine the format of the output from the Parsed String, Parsed Phrases and Display Trace output buttons. These are discussed in more detail below.
- auto: The application will examine the first few characters of the string to be displayed and make a best guess (not always accurate) as to which display mode, ascii or hex, to use.
- ascii: The output is assumed to be ASCII characters. Character codes 9, 10 and 13 are displayed as color-coded "TAB", "LF" and "CR" respectively. Character codes 32-126 are displayed as ASCII characters. All other codes are display as color-coded "xAB" where "AB" is the hexidecimal value of the code.
- hex: The output is treated as binary and displayed in rows of 24 hexidecimal values followed by the ASCII interpretation of values 32-126.
Input Mode: Like Display Mode, the selections are "auto", "ascii" and "hex". This mode, however, controls the interpretation of the "Input String" window contents.
- auto: The application will examine the first few characters in the window and make a best guess (not always accurate) as to the most appropriate mode.
- ascii: The input string is assumed to be ASCII.
- hex: The input string is assumed to be formatted 8-bit bytes, or octets.
Unlike the grammar input, which is always ASCII, no assumptions can be made about what the generated parser is expecting. The grammar may describe purely binary strings. However, an HTML textarea can only reliably accept ASCII text. Even worse, ASCII is not very reliable either. In particular, browsers usually normalize the line endings – some to CRLF and some, like Firefox, to simply LF.
Therefore, a special format has been provided to describe binary strings to the parser. The format is simply space-, comma- and/or line break-delimited integers. The delimiters may occur in multiple numbers and may be any mixture of the three. The integers may be in decimal (no leading zeros) or hexidecimal (eg. 0xFF) format or any mixture them.
Writing input strings in this format can be tedious and an external utility is helpful. Writing such a utility in your favorite language is not difficult and a C-language utility, bin2bytecode, is provided in the APG 6.1 release. The UTF-8 example and some of the SIP examples demonstrate the use of this format.
Save Parsed Phrases: Check this box if you want to examine the phrases matched by the rule name operators in the grammar. This will enable the Parsed Phrases button for phrase viewing. If not checked the Parsed Phrases button will remain disabled. Saving parsed phrases is somewhat resource intensive and is off by default.
Save Parser Trace: Tracing the parser's pathway through the syntax tree is the primary means of debugging a grammar or input string. It is also extremely resource intensive and is off by default. When checked, the Display Trace and Display Filter buttons are enabled after the parse has finished. Otherwise, they remain disabled.
The remaining configuration entries are ignored if the Save Parse Trace box is unchecked. When checked they control the tracing process and are discussed in the Tracing the Parser section below.
If you want more information after parsing than just "red light/green light" this is the place to get it. Since the Generator is also a parser (and translator for the ABNF-for-SABNF grammar given in appendix below) its output is has the same format as that of the Parser.
The output window does double duty for both. Initially the control buttons are disabled. After clicking Generate at least some of the control buttons will be enabled. After clicking Parse they will be enabled with a blue, color-coded margin bar. Except as noted, the description of the output window information is the same in both cases.
The final parser state is simply EMPTY, MATCH or NOMATCH with the obvious meanings. This is followed by the number of characters in the string and the number that were successfully matched.
For the MATCH state, these two numbers will be the same. For the NOMATCH state, the number of characters matched will usually be a pretty good estimate of where in the string the failure occurred.
This output option will be selected automatically after each successful parse.
This will give an annotated display of the parser's input. This will be the ABNF grammar in the case of the Generate button and the input string in the case of the Parse button.
In "ascii" display mode, each line will be printed, preceeded by the line number, the character number of the first character in the line and the number of characters in the line. This can be useful in identifying where in the string an error may have occurred.
If the "Save Parsed Phrases" configuration option has been checked, this can be used to view the parsed phrases within the context of the entire string being parsed.
The drop-down box below will list all of the rule names for which phrases were found, followed in parentheses by the number of times the rule name was matched. Select a rule name and click the Parsed Phrases button. The input string will be displayed with the matched phrases highlighted in blue.
In the case of recursive rules, there may be several phrases nested within one another. In this case, the nested phrases are displayed alternatively in light and dark blue.
If a matched phrase is empty, a green epsilon (ε) will be inserted into the string text at the location where the empty phrase was found.
The Parser Statistics is a count of how many times each node type is processed during the traversal through the syntax tree. The rows and columns give a breakdown of the node types and the results of the node visits.
EMPTY: counts the number of times the node was matched empty.
MATCH: counts the number of times the node was matched non-empty.
NOMATCH: counts the number of times the node failed to match any phrase at all.
ALL: the total number of visits.
BACKTRACK: counts the number of time the parser was forced to backtrack. Note that backtracking only occurs at ALT, REP and PRD nodes. For alternations, ALT, this counts the number of failed alternates. Note that even if the grammar is LL(1) backtracking may occur since no look up table for the correct alternate is used. It is always trial-and-error beginning with the left-most alternate. For repetitions, REP, this may mean that the repetition failed completely, but one backtrack is the normal ending after the last matched repetition of the phrase has been found. Syntactic predicates, PRD, always backtrack.
The columns indicate the counts for the individual node types with TOTAL giving the sum of all. The value for the TOTAL/ALL entry is the grand total of all node visits and, as we will see under Tracing the Parser below, is a useful number when tracing. The total number of traced records is twice this number - one record as the node is opened going down the syntax tree and one record as the final state is determined and the node is closed going up the tree.
The second part of the table counts the number of node visits to the rule name, RNM, nodes for each individual rule name. The rule names are ordered descendingly according to the number of node visits. Rule names with a node count of zero (0) are not listed. I will just mention here that this table is invaluable when taking advantage of the option provided in APG version 6.x to handwrite parsers for the individual rule name phrases. In fact, dealing with that problem is the reason for this section of the statistics table. It tells you, at a glance, which rule names would be most advantagous to hand write parsers for.
It is well known that recursive-descent parsers cannot parse left-recursive grammars. They also cannot parse cyclic or infinite grammars. APG will do a sanity check on the grammar and report any problems.
Altogether the grammar is examined for six different attributes.
- left recursion: A rule name operator for the defined rule appears as the left-most, non-empty term of at least one alternate branch of the syntax tree. e.g. S = S "s" / "x"
- nested recursion: A rule name operator for the defined rule appears as an interior term of at least one alternate branch of the syntax tree. Interior meaning that there is at least one non-empty term to both the left and right of the rule name operator. e.g. S = "a" S "b" / "ab"
- right recursion: A rule name operator for the defined rule appears as the right-most, non-empty term of at least one alternate branch of the syntax tree. e.g. S = "s" S / "y"
- empty: At least one branch of the syntax tree will accept an empty string. e.g. S = "x" / *"s"
- infinite: At least one branch of the syntax tree has no terminal alternate. e.g. S = "s" S
- cyclic: At least one branch of the syntax tree is terminated by a single term, that being the rule name operator for the defined rule itself. e.g. S = S
After clicking Generate, you can see the grammar attributes by clicking "Grammar Attributes" in the output section.
The first table is titled "Self-Recursive Attributes". Self-recursive means that the recursiveness of all rules other than the start rule is ignored. Briefly, the syntax tree of each rule definition is separately expanded as if it were the start rule. Each rule name operator appearing in the expansion is expanded only once. For recursive rules, the second time they appear on any tree branch they are treated as special terminals. Rule names other than the start rule are treated as regular terminals. Start rule names appearing for a second time are given cyclic attributes. These attribute are then traced back up the syntax tree branch, being modified at each node with node-type dependent rules. The final rule name attributes are those remaining upon returning to the root node.
An x in a table cell indicates that the rule does not have the attribute. A check mark indicates that it does have the attribute. Attributes errors (left, infinite and cyclic) are checked in red, normal attributes are checked in green. Rules with errors are listed first, followed by all other rules listed alphabetically.
The next table is titled "Start Rule Cumulative Attributes". The start rule is always assumed to be the first rule in the grammar. Even though the start rule itself may be clean, if it refers to a pathological rule in its definition it can't be parsed. The attributes of all rules referenced by the start rule are examined and if any referenced rule has a pathological attribute (left, cyclic, infinite) the start rule is assumed to have that attribute as well.
The last section is titled "Rules not referenced by the start rule". It is not a syntax error to define a rule which is never referenced by the start rule. In a few cases I've even found it useful to do so. However, in most cases unused rules simply take up resources better used somewhere else and should be eliminated. It is quite easy for un-referenced rules to creep in, for example, when trying to patch together a small grammar from a large document. (For example, when collecting all the rules for the e-mail address example from RFC 5322 it was easy to sweep up many rules which are never used by the "addr-spec" start rule.)
During parsing an error message log is kept. If any errors are detected, this output will be selected automatically at the end of the parse. Especially when checking grammar syntax, this is where to look for grammar errors.
The Display Filter button is disabled unless the "Save Parser Trace" option is checked in the configuration before parsing. If enabled it can be used to filter the saved trace lines which will actually be displayed. See the Tracing the Parser section below for further details.
The Display Trace button is disabled unless the "Save Parser Trace" option is checked in the configuration before parsing. If enabled it displays a subset of the trace lines. See the Tracing the Parser section below for further details.
Debugging a grammar, once it is syntactically correct, can at times be very frustrating. Sometimes it is a simple as noting the number of matched characters in the Parsing Statistics and then locating the last matched character in the Parsed String display. A bad character in the string might be immediately obvious.
More often than not the problem is much more subtle than that and we need to know what the parser was actually doing when the error occurred. The parser trace provides a written history of the parser's exact pathway through the syntax tree as the input string is parsed character by character.
Two records of information are generated for each node of the syntax tree visited by the parser – one when the node is opened on the way down the tree and one when the node is closed and the final state determined on the way back up. Even if time and memory resources were not a problem, the shear number of the records – a million or more is not unusual – can make error location like looking for a needle in a hay stack. Filtering the traced records and a good deal of intuition are often required.
Experience shows that,
- the CPU can easily generate many more records than memory can store
- memory can store many more records than the browser can display.
Therefore, a double filtering technique is used to help focus the limited number of displayed records on the error location.
The first filter is configured with the Parser's configuration (the Configure popup window) and controls which records are actually stored in memory.
The second filter is configured with the "Display Filter" popup window and controls which of the stored records are actually displayed.
We will take a look at those configuration pop ups in a moment, but first let's look at the information that is in the traced records. A good way to follow along is to click any example, let's say the Expressions example just to be on the same page.
- Click Generate
- Click Configure next to Parse
- Check "Save Parser Trace"
- under Traced Operators click Check All
- Display Trace
- Full Screen
At the top of the window are two tables. The first is the "PARSER FILTER" table. It summarizes the results of the parser's filtering of the totality of all of the syntax tree node visits.
max saved records: This is the maximum number of records the parse will save in memory. It defaults to 100,000 but is configurable.
unfiltered records: This is the total number of trace records. As mentioned previously, it is always double the number of node visits in the TOTAL/ALL cell of the "Parser Statistics" table. The unfiltered records are numbered consecutively from 0. These record numbers are unchanged by any of the applied filters and are always used to identify the records in the trace display. They will be referred to as the permanent record numbers to distinguish from all other filtered numbering systems.
operator filtered records: The filtering takes place in two stages. The first stage is to filter out selected operator node types. This is the number of remaining records after the operator filter has been applied.
first: This is the permanent record number of the first operator-filtered record.
last: This is the permanent record number of the last operator-filtered record.
line-range filtered records: The second stage of filtering is to save only a selected range of the lines remaining from the operator filtering. This is the number of lines remaining after both operator and line-range filtering has been done. This is the number of the records actually saved.
first: This is the permanent record number of the first record actually saved in memory.
last: This is the permanent record number of the last record record actually saved in memory.
The second table is the "Display Filter" table. It summarizes the results of the display's filtering of the saved records.
max displayed records: This is the maximum number of records which will be displayed. The default is 1000, but is configurable.
saved records: This is the number of records actually saved. It will always be identical to the line-range filtered records from the previous table.
operator filtered records: This is the number of records remaining after filtering on a further-selected set of operator node types.
first: This is the permanent record number of the first operator-filtered record.
last: This is the permanent record number of the last operator-filtered record.
line-range filtered records: This is the number of records actually displayed. It is the number remaining after selecting a specified line-range from the above operator-filtered records.
first: This is the permanent record number of the first record actually displayed.
last: This is the permanent record number of the last record record actually displayed.Finally we come to the records themselves. The column headings are described here and also by scrolling to the end of the output page.
- (a) This is a sequential numbering of the displayed records.
- (b) The permanent record number of this record.
- (c) The permanent record number of the matching record. For open node records (↓) this will be the matching-node closed record number. Similarly for closed node records (↑) this will be the matching-node open record number. If the matching record has been clipped by the record range selection in the filter this will just be a double dash (--).
- (d) This is the offset into the input string of the phrase being examined.
- (e) For a matched phrase this is the number of characters in the phrase. For a failed phrase matching it will be approximately the number of characters matched before the failure was discovered. For open nodes (↓), the number of characters matched is copied from the matching closed node record.
- (f) This is the relative tree depth of the node. This is also indicated by the indentation of the operator type name, but counting spaces when looking for a matching node is tedious and having the number printed as well is handy.
- (g) This identifies the parsing direction and the final state for closed nodes. A down arrow (↓) indicates that the parser is moving down the syntax tree and the node is just being opened. An up arrow (↑) indicates that the parser is moving up the syntax tree. The state is indicated with N for a phrase not matched, M for a matched non-empty phrase and E for a matched empty phrase.
- operator: This indicates the node type - ALT, CAT, REP, PRD, TRG, TLS, TBS or RNM(rule name). The indentation of the node indicates the relative syntax tree depth at which it was found. The indentation has a period every other space to aid in visually scanning down for a matching node.
- phrase: The last column is the phrase itself, or at least the first few characters of it. This will be up to 48 characters for "ascii" display mode. If fewer than 48 characters remain, and end-of-string (¶) character will be displayed. Up to 16 characters are displayed for "hex" display mode. If there are fewer than 16 characters remaining in the string, dots will be used to fill out the rest of the table. A second row is also given to display the ASCII equivalent of values in the range [32, 126]. If the node represents a matched phrase, the matched characters are highlighted in blue.
The parser filter determines which traced records are actually saved. Here we will describe how it's configured. The filtering is done in two stages. In the first stage specifies that only records for those node visits corresponding to selected operator types are to be saved. The second stage specifies that only a consecutive range of those remaining after the first stage are to be saved.
Record Range Filter: After operator-type filtering there are often far more records remaining than the browser can save in memory. This section is used to select only a range of those remaining records.
max record: This puts an upper limit on the number of lines that will be saved. Regardless of all other options selected, the saving of trace records stops after this number has been saved. The number of records the browser can save depends, of course, on the browser and the machine it is running on. 100,000 seems to work well for a modestly configured, reasonably new machine. If the script hangs during parsing you may need to reduce this limit.
first N records: Select this radio button to save the first lines of the operator-filtered subset. The "first record" box can be used to specify the first record in the range. The number in this box refers to the permanent record number. The "number of records" is the number of records following the first, specified above, to save.
last N records: Select this radio button to save the last of the operator-filtered subset. The "number of records" is the number of records previous to the last to save. This option is selected by default as it more likely than not that parsing errors will halt the parser and hence be found in the last few records.
record range: Select this radio button to save a user-specified range of records. The "first record" specifies the first permanent record number of the saved range. The "last record" specifies the last permanent record number of the saved range.
Operator Filter: This section is used to specify which of the operator node types to save records for.
Node Type Operators: The syntax tree has eight types of nodes as discussed in the APG documentation. The rule name operator, RNM, plays a significant special role and is configured in the next section. The remaining seven are listed here. Trace records are saved only for those whose boxes are checked. The records for these nodes are often uninteresting and are unchecked by default. "Check All" and "Uncheck All" check or clear the boxes for all seven of the operators in this section.
Rule Name Operators: The syntax tree will have RNM nodes for every rule name referenced by the start rule of the grammar. Often the behavior of the parser for only one or a few rule names is of interest and the uninteresting trace records for the remainder can be excluded here. This section lists all rule names alphabetically. Records will be saved only for those rule names with checked boxes. They are all checked by default. "Check All" and "Uncheck All" check or clear boxes in this section.
The Display Filter configuration is similar to the Parser Configuration in many respects. However, whereas the parser filter controls the trace records that are saved, the display filter controls which of those saved records are actually displayed.
The Save, Cancel and Reset buttons, the Display Mode and Record Range Filter sections have the same meaning as those in the Parser Configuration.
Final States: When the parser attempts to match the phrase defined by an operator node, the final state may be matched, not matched or empty. Only records for the checked states will be displayed. As mentioned previously, there are two records generated for each node visit – one as the node visit begins going down the syntax tree and one when it ends coming up, determining the final state. For any node visit ending with an selected final state both of these records will be displayed. For unselected states both will be ignored. The reason for not having this option in the parser configuration is due to a design decision regarding the visual presentation of this pair of traced records.
Operator Filter: These sections operate exactly the same as they do in the Parser Configuration with one additional feature. For any operator not selected in the Parser Configuration there are, obviously, no records saved and hence are unavailable to the display. Therefore, operators corresponding to unsaved records have their check boxes disabled in the Display Filter Configuration.
A number of examples are given here to illustrate the use of Interactive APG. Most are drawn from the examples I've used in the various versions of APG that have previously been released. However, a couple are new and have been developed especially for Interactive APG.
Each example will pre-load a grammar and input string into their respective text areas or provide a drop-down list for selection from multiple choices. Explanatory text annotates the input windows. You can click "Generate" and "Parse" to verify that they work and then modify them as you wish if you like.
Links to each of them are given in the Interactive APG menu.