Backus-Naur Form

Preface

Recently, I spotted a post enquiring about the existence of a BNF for some flavour of BASIC. Since BASIC newsgroups seem to attract a lot of new and inexperienced programming hobbyists, it's doubtful that any more than a handful of the people who read that post understood the question. Not that you need an advanced degree or anything, it's just that BNF doesn't come up much in conversation with BASIC programmers. Which is unfortunate, because the BNF -- or Backus-Naur form -- paints a wonderfully compact and powerful picture of a language.

The purpose of this monograph is to introduce the reader to the Backus-Naur form of notation through a series of easy-to-understand examples. This paper is by no means an exhaustive treatment; it is meant to be nothing more than a bare-bones introduction to the subject. Enjoy!

History <yawn>

Around 1959, John Backus -- one of the thirteen people on the Algol 60 committee -- proposed what is today know as the Backus-Naur form of notation for describing part of the syntax of Algol 60. Peter Naur, of the University of Copenhagen, is associated with this notation because of his modifications and extensive use of BNF as editor of the Algol 60 report.

The ideas behind BNF were advanced by Noam Chomsky, a linguist, in 1956. So why isn't it called CBNF? I'll bet Noam would like to hear the answer to that one.

I Never Meta-Language I Didn't Like

We will begin by examining how BNF may be used to describe digits and integer constants. In BNF, the fact that 1 is a digit is expressed by

(R1)    <digit> ::= 1

"(R1)" is just a label. The term <digit> is delineated by angular brackets to signify that it is a meta-linguistic entity, i.e., it is part of the language we are using to describe the language under discussion. Meta-linguistic terms delineated by angular brackets are typically referred to as nonterminal symbols, or nonterminals. Characters of the language under discussion, such as the digit 1, are referred to as terminals.

The rule (R1) is called a production (or rewriting) rule. Its left part is a nonterminal, and its right part is a nonempty, finite sequence of terminals and nonterminals. The symbol "::=" is read "may be composed of" (or something similar) so the rule (R1) is read

A digit may be composed of the sequence of symbols: 1

To allow the digits 0 and 1 we could write

<digit> ::= 0
<digit> ::= 1
We can combine these two rules using the symbol |, read as "or", to obtain
<digit> ::= 0 | 1

A more comprehensive set of rules might take the form

(R2.1) <constant> ::= <digit>
(R2.2) <constant> ::= <constant> <digit>
(R2.3) <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The rules (R2.x) form a grammar for the language of <constant>s. The sentences of the language are the sequences of terminals that can be derived from the nonterminal <constant>. If we use the symbol => to indicate the process of rewriting, the sentence 187 may be derived as follows:

 => (R2.2) => 7 (R2.3) => 7 (R2.2) => 87 (R2.3) => 87 (R2.1) => 187 (R2.3)

Let " =>* " represent a sequence of zero or more single derivations (or rewrites) using the production rules of our grammar. The previous derivation may then be restated:

<constant> =>* 187.

A Grammar for Simple Arithmetic Expressions

The following grammar uses addition, subtraction, multiplication, parenthesized expressions, and integer constants as operands.

(R3.1) <expression> ::= <expression> + <expression>
(R3.2) <expression> ::= <expression> - <expression>
(R3.3) <expression> ::= <expression> × <expression>
(R3.4) <expression> ::= ( <expression> )
(R3.5) <expression> ::= <constant>

According to this grammar, we derive <expression> =>* 3 × ( 4 - 7 ) as follows:

 => × (R3.3) => × ( ) (R3.4) => × ( - ) (R3.2) => × ( - ) (R3.5) => × ( - ) (R3.5) => × ( - ) (R3.5) => × ( - ) (R2.1) => × ( - ) (R2.1) => × ( - ) (R2.1) => 3 × ( - ) (R2.3) => 3 × ( 4 - ) (R2.3) => 3 × ( 4 - 7 ) (R2.3)

The syntax tree for <expression> =>* 3 × ( 4 - 7 ) has the form:

```
<expression>
/    |     \
<expression>     ×      <expression>
|                  /    |     \
<constant>     <expression> - <expression>
|               |              |
<digit>        <constant>     <constant>
|               |              |
3            <digit>        <digit>
|              |
4              7
```

The problem with syntax trees is that they don't specify the order in which some production rules are applied. For instance, the syntax tree above doesn't indicate whether <digit> ::= 4 precedes <digit> ::= 7, or vice versa.

Every derivation has a corresponding syntax tree, but more than one derivation can correspond to the same tree. These derivations are considered to be equivalent. For example,

 => + => + => + => + => + => 2 + => 2 + 3
and
 => + => + => + => + => + => + 3 => 2 + 3
both correspond to the same syntax tree and are therefore considered to be equivalent. <shrug>

Garden of Ambiguitiy

Consider <expression> =>* <expression> × <expression> - <expression>. This sentence satisfies both of the syntax trees

```
<expression>
/    |     \
<expression>     ×      <expression>
/    |     \
<expression> - <expression>
```
and
```
<expression>
/    |     \
<expression>     -      <expression>
/    |     \
<expression> × <expression>
```

Any grammar that allows more than one syntax tree for some sentence is called ambiguous. The term is apt because if a sentence has two different syntax trees then it can be parsed in two different ways. This can result in a particular interpretation being assigned arbitrarily, or worse, inadvertently. Obviously, whether 2 × 4 - 3 = 5 or 2 depends upon how we parse the sentence.

Fortunately, it isn't difficult to specify an unambiguous grammar for simple arithmetic expressions. The solution does, however, require us to introduce a few new nonterminals.

 ::= | + | - ::= | × ::= | ( ) ::= ::= ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

From this grammar, it's clear that multiplication takes precedence over addition and subtraction. Thus, if you draw the syntax tree for <expression> =>* 2 + 3 × 4 you'll find that the result is 14, not 20. We can, of course, use parentheses to supercede precedence.

Extensions

As usual, people just can't leave well enough alone. In an effort to improve on BNF, some smart-ass suggested using braces to indicate repetition, i.e., {xyz} denotes zero or more occurrences of the sequence of symbols xyz.

Using this extension, we can rewrite our grammar for simple arithmetic expressions as

 ::= { + | - } ::= { × } ::= | ( ) ::= { } ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The Last Word

I admire Harlan Ellison's work, but I'll never understand why he insists on being such a monumental dickweed.

C'ya,
RudeJohn
"I'm rude. It's a job."