__Backus-Naur Form__

P__reface__

Recently, I spotted a post enquiring about the existence of a BNF for some flavour of BASIC. Since BASIC newsgroups seem to attract a lot of new and inexperienced programming hobbyists, it's doubtful that any more than a handful of the people who read that post understood the question. Not that you need an advanced degree or anything, it's just that BNF doesn't come up much in conversation with BASIC programmers. Which is unfortunate, because the BNF -- or Backus-Naur form -- paints a wonderfully compact and powerful picture of a language.

The purpose of this monograph is to introduce the reader to the Backus-Naur form of notation through a series of easy-to-understand examples. This paper is by no means an exhaustive treatment; it is meant to be nothing more than a bare-bones introduction to the subject. Enjoy!

H__istory__ <yawn>

Around 1959, John Backus -- one of the thirteen people on the Algol 60 committee -- proposed what is today know as the Backus-Naur form of notation for describing part of the syntax of Algol 60. Peter Naur, of the University of Copenhagen, is associated with this notation because of his modifications and extensive use of BNF as editor of the Algol 60 report.

The ideas behind BNF were advanced by Noam Chomsky, a linguist, in 1956. So why
isn't it called CBNF? I'll bet Noam would like to hear the answer to *that* one.

I __Never Meta-Language I Didn't Like__

We will begin by examining how BNF may be used to describe digits and integer
constants. In BNF, the fact that **1** is a digit is expressed by

(R1)<digit> ::= 1

"(R1)" is just a label. The term **<digit>** is delineated by angular brackets
to signify that it is a meta-linguistic entity, i.e., it is part of the language we are using to
describe the language under discussion. Meta-linguistic terms delineated by angular
brackets are typically referred to as nonterminal symbols, or nonterminals. Characters
of the language under discussion, such as the digit **1**, are referred to as terminals.

The rule (R1) is called a production (or rewriting) rule. Its left part is a nonterminal, and
its right part is a nonempty, finite sequence of terminals and nonterminals. The symbol
"**::=**" is read "may be composed of" (or something similar) so the rule (R1) is read

A digit may be composed of the sequence of symbols:1

To allow the digits **0** and **1** we could write

We can combine these two rules using the symbol<digit> ::= 0

<digit> ::= 1

<digit> ::= 0 | 1

A more comprehensive set of rules might take the form

(R2.1)<constant> ::= <digit>

(R2.2)<constant> ::= <constant> <digit>

(R2.3)<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The rules (R2.x) form a grammar for the language of **<constant>**s. The sentences
of the language are the sequences of terminals that can be derived from the nonterminal
**<constant>**. If we use the symbol **=>** to indicate the process of
rewriting, the sentence **187** may be derived as follows:

<constant> |
=> <constant> <digit> |
(R2.2) |

=> <constant> 7 |
(R2.3) | |

=> <constant> <digit> 7 |
(R2.2) | |

=> <constant> 87 |
(R2.3) | |

=> <digit> 87 |
(R2.1) | |

=> 187 |
(R2.3) |

Let " **=>*** " represent a sequence of zero or more single derivations (or rewrites)
using the production rules of our grammar. The previous derivation may then be restated:

<constant> =>* 187.

A__ Grammar for Simple Arithmetic Expressions__

The following grammar uses addition, subtraction, multiplication, parenthesized expressions, and integer constants as operands.

(R3.1)<expression> ::= <expression> + <expression>

(R3.2)<expression> ::= <expression> - <expression>

(R3.3)<expression> ::= <expression> × <expression>

(R3.4)<expression> ::= ( <expression> )

(R3.5)<expression> ::= <constant>

According to this grammar, we derive **<expression> =>* 3 × ( 4 - 7 )** as follows:

<expression> |
=> <expression> × <expression> |
(R3.3) |

=> <expression> × ( <expression> ) |
(R3.4) | |

=> <expression> × ( <expression> - <expression> ) |
(R3.2) | |

=> <constant> × ( <expression> - <expression> ) |
(R3.5) | |

=> <constant> × ( <constant> - <expression> ) |
(R3.5) | |

=> <constant> × ( <constant> - <constant> ) |
(R3.5) | |

=> <digit> × ( <constant> - <constant> ) |
(R2.1) | |

=> <digit> × ( <digit> - <constant> ) |
(R2.1) | |

=> <digit> × ( <digit> - <digit> ) |
(R2.1) | |

=> 3 × ( <digit> - <digit> ) |
(R2.3) | |

=> 3 × ( 4 - <digit> ) |
(R2.3) | |

=> 3 × ( 4 - 7 ) |
(R2.3) |

The *syntax tree* for **<expression> =>* 3 × ( 4 - 7 )** has the form:

<expression>/ | \<expression>×<expression>| / | \<constant><expression>-<expression>| | |<digit><constant><constant>| | |3<digit><digit>| |47

The problem with syntax trees is that they don't specify the order in which some
production rules are applied. For instance, the syntax tree above doesn't indicate
whether **<digit> ::= 4** precedes **<digit> ::= 7**, or vice versa.

Every derivation has a corresponding syntax tree, but more than one derivation can correspond to the same tree. These derivations are considered to be equivalent. For example,

and

<expression>=> <expression> + <expression>=> <constant> + <expression>=> <constant> + <constant>=> <digit> + <constant>=> <digit> + <digit>=> 2 + <digit>=> 2 + 3

both correspond to the same syntax tree and are therefore considered to be equivalent. <

<expression>=> <expression> + <expression>=> <expression> + <constant>=> <constant> + <constant>=> <constant> + <digit>=> <digit> + <digit>=> <digit> + 3=> 2 + 3

G__arden of Ambiguitiy__

Consider **<expression> =>* <expression> × <expression> - <expression>.***both* of the syntax trees

and<expression>/ | \<expression>×<expression>/ | \<expression>-<expression>

<expression>/ | \<expression>-<expression>/ | \<expression>×<expression>

Any grammar that allows more than one syntax tree for some sentence is called *ambiguous*.
The term is apt because if a sentence has two different syntax trees then it can be parsed in two
different ways. This can result in a particular interpretation being assigned arbitrarily, or
worse, inadvertently. Obviously, whether **2 × 4 - 3 = 5****2**
depends upon how we parse the sentence.

D__ead Presidents__

Fortunately, it isn't difficult to specify an unambiguous grammar for simple arithmetic expressions. The solution does, however, require us to introduce a few new nonterminals.

<expression>::= <term> | <expression> + <term> | <expression> - <term><term>::= <factor> | <term> × <factor><factor>::= <constant> | ( <expression> )<constant>::= <digit><constant>::= <constant> <digit><digit>::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

From this grammar, it's clear that multiplication takes precedence over addition and subtraction. Thus,
if you draw the syntax tree for **<expression> =>* 2 + 3 × 4****14**, not **20**. We can, of course, use parentheses to supercede
precedence.

E__xtensions__

As usual, people just can't leave well enough alone. In an effort to *improve* on BNF,
some smart-ass suggested using braces to indicate repetition, i.e., **{xyz}** denotes
zero or more occurrences of the sequence of symbols **xyz.**

Using this extension, we can rewrite our grammar for simple arithmetic expressions as

<expression>::= <term> { + <term> | - <term> }<term>::= <factor> { × <factor> }<factor>::= <constant> | ( <expression> )<constant>::= <digit> { <digit> }<digit>::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

T__he Last Word__

I admire Harlan Ellison's work, but I'll never understand why he insists on being such a monumental dickweed.

C'ya**,**

RudeJohn

*"I'm rude . It's a job."*