Yacc: Yet Another Compiler-Compiler

来源：互联网发布：东方精神知乎编辑：程序博客网时间：2024/05/16 14:41

The Lex & Yacc Page

Yacc: Yet Another Compiler-Compiler

Stephen C. Johnson
AT&T Bell Laboratories
Murray Hill, New Jersey 07974

ABSTRACT

Computer program input generally has some structure; in fact, every computer program that does input can be thought of as defining an ``input language'' which it accepts. An input language may be as complex as a programming language, or as simple as a sequence of numbers. Unfortunately, usual input facilities are limited, difficult to use, and often are lax about checking their inputs for validity.

Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized. Yacc turns such a specification into a subroutine that handles the input process; frequently, it is convenient and appropriate to have most of the flow of control in the user's application handled by this subroutine.

The input subroutine produced by Yacc calls a user-supplied routine to return the next basic input item. Thus, the user can specify his input in terms of individual input characters, or in terms of higher level constructs such as names and numbers. The user-supplied routine may also handle idiomatic features such as comment and continuation conventions, which typically defy easy grammatical specification.

Yacc is written in portable C. The class of specifications accepted is a very general one: LALR(1) grammars with disambiguating rules.

In addition to compilers for C, APL, Pascal, RATFOR, etc., Yacc has also been used for less conventional languages, including a phototypesetter language, several desk calculator languages, a document retrieval system, and a Fortran debugging system.

0: Introduction

Yacc provides a general tool for imposing structure on theinput to a computer program. The Yacc user prepares a specificationof the input process; this includes rules describing theinput structure, code to be invoked when these rules are recognized,and a low-level routine to do the basic input. Yacc thengenerates a function to control the input process. This function,called a parser, calls the user-supplied low-level inputroutine (the lexical analyzer) to pick up the basic items (calledtokens) from the input stream. These tokens are organizedaccording to the input structure rules, called grammar rules;when one of these rules has been recognized, then user code suppliedfor this rule, an action, is invoked; actions have theability to return values and make use of the values of otheractions.

Yacc is written in a portable dialect of C[1] and theactions, and output subroutine, are in C as well. Moreover, manyof the syntactic conventions of Yacc follow C.

The heart of the input specification is a collection ofgrammar rules. Each rule describes an allowable structure andgives it a name. For example, one grammar rule might be

        date  :  month_name  day  ','  year   ;

Here, date, month_name, day, and year represent structures ofinterest in the input process; presumably, month_name, day, andyear are defined elsewhere. The comma ``,'' is enclosed in singlequotes; this implies that the comma is to appear literally inthe input. The colon and semicolon merely serve as punctuationin the rule, and have no significance in controlling the input.Thus, with proper definitions, the input

        July  4, 1776

might be matched by the above rule.

An important part of the input process is carried out by thelexical analyzer. This user routine reads the input stream,recognizing the lower level structures, and communicates thesetokens to the parser. For historical reasons, a structure recognizedby the lexical analyzer is called a terminal symbol, whilethe structure recognized by the parser is called a nonterminalsymbol. To avoid confusion, terminal symbols will usually bereferred to as tokens.

There is considerable leeway in deciding whether to recognize structures using the lexical analyzer or grammar rules. Forexample, the rules

        month_name  :  'J' 'a' 'n'   ;        month_name  :  'F' 'e' 'b'   ;                 . . .        month_name  :  'D' 'e' 'c'   ;

might be used in the above example. The lexical analyzer wouldonly need to recognize individual letters, and month_name wouldbe a nonterminal symbol. Such low-level rules tend to waste timeand space, and may complicate the specification beyond Yacc'sability to deal with it. Usually, the lexical analyzer wouldrecognize the month names, and return an indication that amonth_name was seen; in this case, month_name would be a token.

Literal characters such as ``,'' must also be passed throughthe lexical analyzer, and are also considered tokens.

Specification files are very flexible. It is realively easyto add to the above example the rule

        date  :  month '/' day '/' year   ;

allowing

        7 / 4 / 1776

as a synonym for

        July 4, 1776

In most cases, this new rule could be ``slipped in'' to a workingsystem with minimal effort, and little danger of disruptingexisting input.

The input being read may not conform to the specifications.These input errors are detected as early as is theoretically possiblewith a left-to-right scan; thus, not only is the chance ofreading and computing with bad input data substantially reduced,but the bad data can usually be quickly found. Error handling,provided as part of the input specifications, permits the reentryof bad data, or the continuation of the input process after skippingover the bad data.

In some cases, Yacc fails to produce a parser when given aset of specifications. For example, the specifications may beself contradictory, or they may require a more powerful recognitionmechanism than that available to Yacc. The former casesrepresent design errors; the latter cases can often be correctedby making the lexical analyzer more powerful, or by rewritingsome of the grammar rules. While Yacc cannot handle all possiblespecifications, its power compares favorably with similar systems;moreover, the constructions which are difficult for Yacc to

handle are also frequently difficult for human beings to handle.Some users have reported that the discipline of formulating validYacc specifications for their input revealed errors of conceptionor design early in the program development.

The theory underlying Yacc has been described elsewhere.[2,3, 4] Yacc has been extensively used in numerous practical applications,including lint,[5] the Portable C Compiler,[6] and asystem for typesetting mathematics.[7]

The next several sections describe the basic process ofpreparing a Yacc specification; Section 1 describes the preparationof grammar rules, Section 2 the preparation of the user suppliedactions associated with these rules, and Section 3 thepreparation of lexical analyzers. Section 4 describes the operationof the parser. Section 5 discusses various reasons why Yaccmay be unable to produce a parser from a specification, and whatto do about it. Section 6 describes a simple mechanism for handlingoperator precedences in arithmetic expressions. Section 7discusses error detection and recovery. Section 8 discusses theoperating environment and special features of the parsers Yaccproduces. Section 9 gives some suggestions which should improvethe style and efficiency of the specifications. Section 10discusses some advanced topics, and Section 11 gives acknowledgements.Appendix A has a brief example, and Appendix B gives asummary of the Yacc input syntax. Appendix C gives an exampleusing some of the more advanced features of Yacc, and, finally,Appendix D describes mechanisms and syntax no longer activelysupported, but provided for historical continuity with older versionsof Yacc.

1: Basic Specifications

Names refer to either tokens or nonterminal symbols. Yaccrequires token names to be declared as such. In addition, forreasons discussed in Section 3, it is often desirable to includethe lexical analyzer as part of the specification file; it may beuseful to include other programs as well. Thus, every specificationfile consists of three sections: the declarations, (grammar)rules, and programs. The sections are separated by double percent``%%'' marks. (The percent ``%'' is generally used in Yaccspecifications as an escape character.)

In other words, a full specification file looks like

        declarations        %%        rules        %%        programs

The declaration section may be empty. Moreover, if the programs section is omitted, the second %% mark may be omitted also;

thus, the smallest legal Yacc specification is

        %%        rules

Blanks, tabs, and newlines are ignored except that they maynot appear in names or multi-character reserved symbols. Commentsmay appear wherever a name is legal; they are enclosed in/* . . . */, as in C and PL/I.

The rules section is made up of one or more grammar rules.A grammar rule has the form:

        A  :  BODY  ;

A represents a nonterminal name, and BODY represents a sequenceof zero or more names and literals. The colon and the semicolonare Yacc punctuation.

Names may be of arbitrary length, and may be made up ofletters, dot ``.'', underscore ``_'', and non-initial digits.Upper and lower case letters are distinct. The names used in thebody of a grammar rule may represent tokens or nonterminal symbols.

A literal consists of a character enclosed in single quotes``'''. As in C, the backslash ``\'' is an escape characterwithin literals, and all the C escapes are recognized. Thus

        '\n'    newline        '\r'    return        '\''    single quote ``'''        '\\'    backslash ``\''        '\t'    tab        '\b'    backspace        '\f'    form feed        '\xxx'  ``xxx'' in octal

For a number of technical reasons, the NUL character ('\0' or 0)should never be used in grammar rules.

If there are several grammar rules with the same left handside, the vertical bar ``|'' can be used to avoid rewriting theleft hand side. In addition, the semicolon at the end of a rulecan be dropped before a vertical bar. Thus the grammar rules

        A       :       B  C  D   ;        A       :       E  F   ;        A       :       G   ;

can be given to Yacc as

        A       :       B  C  D                |       E  F                |       G                ;

It is not necessary that all grammar rules with the same leftside appear together in the grammar rules section, although itmakes the input much more readable, and easier to change.

If a nonterminal symbol matches the empty string, this canbe indicated in the obvious way:

        empty :   ;

Names representing tokens must be declared; this is mostsimply done by writing

        %token   name1  name2 . . .

in the declarations section. (See Sections 3 , 5, and 6 for muchmore discussion). Every name not defined in the declarationssection is assumed to represent a nonterminal symbol. Every nonterminalsymbol must appear on the left side of at least onerule.

Of all the nonterminal symbols, one, called the start symbol,has particular importance. The parser is designed to recognizethe start symbol; thus, this symbol represents the largest,most general structure described by the grammar rules. Bydefault, the start symbol is taken to be the left hand side ofthe first grammar rule in the rules section. It is possible, andin fact desirable, to declare the start symbol explicitly in thedeclarations section using the %start keyword:

        %start   symbol

The end of the input to the parser is signaled by a specialtoken, called the endmarker. If the tokens up to, but notincluding, the endmarker form a structure which matches the startsymbol, the parser function returns to its caller after the endmarkeris seen; it accepts the input. If the endmarker is seenin any other context, it is an error.

It is the job of the user-supplied lexical analyzer toreturn the endmarker when appropriate; see section 3, below.Usually the endmarker represents some reasonably obvious I/Ostatus, such as ``end-of-file'' or ``end-of-record''.

2: Actions

With each grammar rule, the user may associate actions to be

performed each time the rule is recognized in the input process.These actions may return values, and may obtain the valuesreturned by previous actions. Moreover, the lexical analyzer canreturn values for tokens, if desired.

An action is an arbitrary C statement, and as such can doinput and output, call subprograms, and alter external vectorsand variables. An action is specified by one or more statements,enclosed in curly braces ``{'' and ``}''. For example,

        A       :       '('  B  ')'                                {       hello( 1, "abc" );  }

and

        XXX     :       YYY  ZZZ                                {       printf("a message\n");                                        flag = 25;   }

are grammar rules with actions.

To facilitate easy communication between the actions and theparser, the action statements are altered slightly. The symbol``dollar sign'' ``$'' is used as a signal to Yacc in this context.

To return a value, the action normally sets the pseudovariable``$$'' to some value. For example, an action that doesnothing but return the value 1 is

                {  $$ = 1;  }

To obtain the values returned by previous actions and thelexical analyzer, the action may use the pseudo-variables $1, $2,. . ., which refer to the values returned by the components ofthe right side of a rule, reading from left to right. Thus, ifthe rule is

        A       :       B  C  D   ;

for example, then $2 has the value returned by C, and $3 thevalue returned by D.

As a more concrete example, consider the rule

        expr    :       '('  expr  ')'   ;

The value returned by this rule is usually the value of the exprin parentheses. This can be indicated by

        expr    :        '('  expr  ')'         {  $$ = $2 ;  }

By default, the value of a rule is the value of the firstelement in it ($1). Thus, grammar rules of the form

        A       :       B    ;

frequently need not have an explicit action.

In the examples above, all the actions came at the end oftheir rules. Sometimes, it is desirable to get control before arule is fully parsed. Yacc permits an action to be written inthe middle of a rule as well as at the end. This rule is assumedto return a value, accessible through the usual mechanism by theactions to the right of it. In turn, it may access the valuesreturned by the symbols to its left. Thus, in the rule

        A       :       B                                {  $$ = 1;  }                        C                                {   x = $2;   y = $3;  }                ;

the effect is to set x to 1, and y to the value returned by C.

Actions that do not terminate a rule are actually handled byYacc by manufacturing a new nonterminal symbol name, and a newrule matching this name to the empty string. The interior actionis the action triggered off by recognizing this added rule. Yaccactually treats the above example as if it had been written:

        $ACT    :       /* empty */                                {  $$ = 1;  }                ;        A       :       B  $ACT  C                                {   x = $2;   y = $3;  }                ;

In many applications, output is not done directly by theactions; rather, a data structure, such as a parse tree, is constructedin memory, and transformations are applied to it beforeoutput is generated. Parse trees are particularly easy to construct,given routines to build and maintain the tree structuredesired. For example, suppose there is a C function node, writtenso that the call

        node( L, n1, n2 )

creates a node with label L, and descendants n1 and n2, andreturns the index of the newly created node. Then parse tree canbe built by supplying actions such as:

        expr    :       expr  '+'  expr                                {  $$ = node( '+', $1, $3 );  }

in the specification.

The user may define other variables to be used by theactions. Declarations and definitions can appear in the declarationssection, enclosed in the marks ``%{'' and ``%}''. Thesedeclarations and definitions have global scope, so they are knownto the action statements and the lexical analyzer. For example,

        %{   int variable = 0;   %}

could be placed in the declarations section, making variableaccessible to all of the actions. The Yacc parser uses onlynames beginning in ``yy''; the user should avoid such names.

In these examples, all the values are integers: a discussionof values of other types will be found in Section 10.

3: Lexical Analysis

The user must supply a lexical analyzer to read the inputstream and communicate tokens (with values, if desired) to theparser. The lexical analyzer is an integer-valued functioncalled yylex. The function returns an integer, the token number,representing the kind of token read. If there is a value associatedwith that token, it should be assigned to the external variableyylval.

The parser and the lexical analyzer must agree on thesetoken numbers in order for communication between them to takeplace. The numbers may be chosen by Yacc, or chosen by the user.In either case, the ``# define'' mechanism of C is used to allowthe lexical analyzer to return these numbers symbolically. Forexample, suppose that the token name DIGIT has been defined inthe declarations section of the Yacc specification file. Therelevant portion of the lexical analyzer might look like:

        yylex(){                extern int yylval;                int c;                . . .                c = getchar();                . . .                switch( c ) {                        . . .                case '0':                case '1':                  . . .                case '9':                        yylval = c-'0';                        return( DIGIT );                        . . .                        }                . . .

The intent is to return a token number of DIGIT, and a valueequal to the numerical value of the digit. Provided that thelexical analyzer code is placed in the programs section of thespecification file, the identifier DIGIT will be defined as thetoken number associated with the token DIGIT.

This mechanism leads to clear, easily modified lexicalanalyzers; the only pitfall is the need to avoid using any tokennames in the grammar that are reserved or significant in C or theparser; for example, the use of token names if or while willalmost certainly cause severe difficulties when the lexicalanalyzer is compiled. The token name error is reserved for errorhandling, and should not be used naively (see Section 7).

As mentioned above, the token numbers may be chosen by Yaccor by the user. In the default situation, the numbers are chosenby Yacc. The default token number for a literal character is thenumerical value of the character in the local character set.Other names are assigned token numbers starting at 257.

To assign a token number to a token (including literals),the first appearance of the token name or literal in the declarationssection can be immediately followed by a nonnegativeinteger. This integer is taken to be the token number of thename or literal. Names and literals not defined by this mechanismretain their default definition. It is important that alltoken numbers be distinct.

For historical reasons, the endmarker must have token number0 or negative. This token number cannot be redefined by theuser; thus, all lexical analyzers should be prepared to return 0or negative as a token number upon reaching the end of theirinput.

A very useful tool for constructing lexical analyzers is theLex program developed by Mike Lesk.[8] These lexical analyzersare designed to work in close harmony with Yacc parsers. Thespecifications for these lexical analyzers use regular expressionsinstead of grammar rules. Lex can be easily used to producequite complicated lexical analyzers, but there remain somelanguages (such as FORTRAN) which do not fit any theoreticalframework, and whose lexical analyzers must be crafted by hand.

4: How the Parser Works

Yacc turns the specification file into a C program, whichparses the input according to the specification given. The algorithmused to go from the specification to the parser is complex,and will not be discussed here (see the references for moreinformation). The parser itself, however, is relatively simple,and understanding how it works, while not strictly necessary,will nevertheless make treatment of error recovery and ambiguitiesmuch more comprehensible.

The parser produced by Yacc consists of a finite statemachine with a stack. The parser is also capable of reading andremembering the next input token (called the lookahead token).The current state is always the one on the top of the stack. Thestates of the finite state machine are given small integerlabels; initially, the machine is in state 0, the stack containsonly state 0, and no lookahead token has been read.

The machine has only four actions available to it, calledshift, reduce, accept, and error. A move of the parser is doneas follows:

1. Based on its current state, the parser decides whether it needs a lookahead token to decide what action should be done; if it needs one, and does not have one, it calls yylex to obtain the next token.

2. Using the current state, and the lookahead token if needed, the parser decides on its next action, and carries it out. This may result in states being pushed onto the stack, or popped off of the stack, and in the lookahead token being processed or left alone.

The shift action is the most common action the parser takes.Whenever a shift action is taken, there is always a lookaheadtoken. For example, in state 56 there may be an action:

                IF      shift 34

which says, in state 56, if the lookahead token is IF, thecurrent state (56) is pushed down on the stack, and state 34becomes the current state (on the top of the stack). The lookaheadtoken is cleared.

The reduce action keeps the stack from growing withoutbounds. Reduce actions are appropriate when the parser has seenthe right hand side of a grammar rule, and is prepared toannounce that it has seen an instance of the rule, replacing theright hand side by the left hand side. It may be necessary toconsult the lookahead token to decide whether to reduce, but usuallyit is not; in fact, the default action (represented by a``.'') is often a reduce action.

Reduce actions are associated with individual grammar rules.Grammar rules are also given small integer numbers, leading tosome confusion. The action

                .       reduce 18

refers to grammar rule 18, while the action

                IF      shift 34

refers to state 34.

Suppose the rule being reduced is A : x y z ;

The reduce action depends on the left hand symbol (A in thiscase), and the number of symbols on the right hand side (three inthis case). To reduce, first pop off the top three states fromthe stack (In general, the number of states popped equals thenumber of symbols on the right side of the rule). In effect,these states were the ones put on the stack while recognizing x,y, and z, and no longer serve any useful purpose. After poppingthese states, a state is uncovered which was the state the parserwas in before beginning to process the rule. Using thisuncovered state, and the symbol on the left side of the rule,perform what is in effect a shift of A. A new state is obtained,pushed onto the stack, and parsing continues. There are significantdifferences between the processing of the left hand symboland an ordinary shift of a token, however, so this action iscalled a goto action. In particular, the lookahead token iscleared by a shift, and is not affected by a goto. In any case,the uncovered state contains an entry such as:

                A       goto 20

causing state 20 to be pushed onto the stack, and become thecurrent state.

In effect, the reduce action ``turns back the clock'' in theparse, popping the states off the stack to go back to the statewhere the right hand side of the rule was first seen. The parserthen behaves as if it had seen the left side at that time. Ifthe right hand side of the rule is empty, no states are poppedoff of the stack: the uncovered state is in fact the currentstate.

The reduce action is also important in the treatment ofuser-supplied actions and values. When a rule is reduced, thecode supplied with the rule is executed before the stack isadjusted. In addition to the stack holding the states, anotherstack, running in parallel with it, holds the values returnedfrom the lexical analyzer and the actions. When a shift takesplace, the external variable yylval is copied onto the valuestack. After the return from the user code, the reduction iscarried out. When the goto action is done, the external variableyyval is copied onto the value stack. The pseudo-variables $1,$2, etc., refer to the value stack.

The other two parser actions are conceptually much simpler.The accept action indicates that the entire input has been seenand that it matches the specification. This action appears onlywhen the lookahead token is the endmarker, and indicates that theparser has successfully done its job. The error action, on theother hand, represents a place where the parser can no longercontinue parsing according to the specification. The input

tokens it has seen, together with the lookahead token, cannot befollowed by anything that would result in a legal input. Theparser reports an error, and attempts to recover the situationand resume parsing: the error recovery (as opposed to the detectionof error) will be covered in Section 7.

It is time for an example! Consider the specification

        %token  DING  DONG  DELL        %%        rhyme   :       sound  place                ;        sound   :       DING  DONG                ;        place   :       DELL                ;

When Yacc is invoked with the -v option, a file calledy.output is produced, with a human-readable description of theparser. The y.output file corresponding to the above grammar(with some statistics stripped off the end) is:

        state 0                $accept  :  _rhyme  $end                DING  shift 3                .  error                rhyme  goto 1                sound  goto 2        state 1                $accept  :   rhyme_$end                $end  accept                .  error        state 2                rhyme  :   sound_place                DELL  shift 5                .  error                place   goto 4        state 3                sound   :   DING_DONG                DONG  shift 6                .  error        state 4                rhyme  :   sound  place_    (1)                .   reduce  1        state 5                place  :   DELL_    (3)                .   reduce  3        state 6                sound   :   DING  DONG_    (2)                .   reduce  2

Notice that, in addition to the actions for each state, there isa description of the parsing rules being processed in each state.The _ character is used to indicate what has been seen, and whatis yet to come, in each rule. Suppose the input is

        DING  DONG  DELL

It is instructive to follow the steps of the parser while processingthis input.

Initially, the current state is state 0. The parser needsto refer to the input in order to decide between the actionsavailable in state 0, so the first token, DING, is read, becomingthe lookahead token. The action in state 0 on DING is is ``shift3'', so state 3 is pushed onto the stack, and the lookahead tokenis cleared. State 3 becomes the current state. The next token,DONG, is read, becoming the lookahead token. The action in state3 on the token DONG is ``shift 6'', so state 6 is pushed onto thestack, and the lookahead is cleared. The stack now contains 0,3, and 6. In state 6, without even consulting the lookahead, theparser reduces by rule 2.

                sound  :   DING  DONG

This rule has two symbols on the right hand side, so two states,6 and 3, are popped off of the stack, uncovering state 0. Consultingthe description of state 0, looking for a goto on sound,

                sound   goto 2

is obtained; thus state 2 is pushed onto the stack, becoming thecurrent state.

In state 2, the next token, DELL, must be read. The actionis ``shift 5'', so state 5 is pushed onto the stack, which nowhas 0, 2, and 5 on it, and the lookahead token is cleared. Instate 5, the only action is to reduce by rule 3. This has onesymbol on the right hand side, so one state, 5, is popped off,and state 2 is uncovered. The goto in state 2 on place, the leftside of rule 3, is state 4. Now, the stack contains 0, 2, and 4.In state 4, the only action is to reduce by rule 1. There aretwo symbols on the right, so the top two states are popped off,uncovering state 0 again. In state 0, there is a goto on rhymecausing the parser to enter state 1. In state 1, the input isread; the endmarker is obtained, indicated by ``$end'' in they.output file. The action in state 1 when the endmarker is seenis to accept, successfully ending the parse.

The reader is urged to consider how the parser works whenconfronted with such incorrect strings as DING DONG DONG, DINGDONG, DING DONG DELL DELL, etc. A few minutes spend with thisand other simple examples will probably be repaid when problemsarise in more complicated contexts.

5: Ambiguity and Conflicts

A set of grammar rules is ambiguous if there is some inputstring that can be structured in two or more different ways. Forexample, the grammar rule

        expr    :       expr  '-'  expr

is a natural way of expressing the fact that one way of formingan arithmetic expression is to put two other expressions together

with a minus sign between them. Unfortunately, this grammar ruledoes not completely specify the way that all complex inputsshould be structured. For example, if the input is

        expr  -  expr  -  expr

the rule allows this input to be structured as either

( expr - expr ) - expr

or as

        expr  -  (  expr  -  expr  )

(The first is called left association, the second right association).

Yacc detects such ambiguities when it is attempting to buildthe parser. It is instructive to consider the problem that confrontsthe parser when it is given an input such as

        expr  -  expr  -  expr

When the parser has read the second expr, the input that it hasseen:

        expr  -  expr

matches the right side of the grammar rule above. The parsercould reduce the input by applying this rule; after applying therule; the input is reduced to expr(the left side of the rule).The parser would then read the final part of the input:

        -  expr

and again reduce. The effect of this is to take the left associativeinterpretation.

Alternatively, when the parser has seen

expr - expr

it could defer the immediate application of the rule, and continuereading the input until it had seen

        expr  -  expr  -  expr

It could then apply the rule to the rightmost three symbols,reducing them to expr and leaving

        expr  -  expr

Now the rule can be reduced once more; the effect is to take theright associative interpretation. Thus, having read

        expr  -  expr

the parser can do two legal things, a shift or a reduction, andhas no way of deciding between them. This is called a shift /reduce conflict. It may also happen that the parser has a choiceof two legal reductions; this is called a reduce / reduce conflict.Note that there are never any ``Shift/shift'' conflicts.

When there are shift/reduce or reduce/reduce conflicts, Yaccstill produces a parser. It does this by selecting one of thevalid steps wherever it has a choice. A rule describing whichchoice to make in a given situation is called a disambiguatingrule.

Yacc invokes two disambiguating rules by default:

1. In a shift/reduce conflict, the default is to do the shift.

2. In a reduce/reduce conflict, the default is to reduce by the earlier grammar rule (in the input sequence).

Rule 1 implies that reductions are deferred whenever thereis a choice, in favor of shifts. Rule 2 gives the user rathercrude control over the behavior of the parser in this situation,but reduce/reduce conflicts should be avoided whenever possible.

Conflicts may arise because of mistakes in input or logic,or because the grammar rules, while consistent, require a morecomplex parser than Yacc can construct. The use of actionswithin rules can also cause conflicts, if the action must be donebefore the parser can be sure which rule is being recognized. Inthese cases, the application of disambiguating rules is inappropriate,and leads to an incorrect parser. For this reason,Yacc always reports the number of shift/reduce and reduce/reduceconflicts resolved by Rule 1 and Rule 2.

In general, whenever it is possible to apply disambiguatingrules to produce a correct parser, it is also possible to rewritethe grammar rules so that the same inputs are read but there areno conflicts. For this reason, most previous parser generatorshave considered conflicts to be fatal errors. Our experience hassuggested that this rewriting is somewhat unnatural, and producesslower parsers; thus, Yacc will produce parsers even in the presenceof conflicts.

As an example of the power of disambiguating rules, considera fragment from a programming language involving an ``if-then-else''construction:

        stat    :       IF  '('  cond  ')'  stat                |       IF  '('  cond  ')'  stat  ELSE  stat                ;

In these rules, IF and ELSE are tokens, cond is a nonterminalsymbol describing conditional (logical) expressions, and stat isa nonterminal symbol describing statements. The first rule willbe called the simple-if rule, and the second the if-else rule.

These two rules form an ambiguous construction, since inputof the form

        IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2

can be structured according to these rules in two ways:

        IF  (  C1  )  {                IF  (  C2  )  S1                }        ELSE  S2

        IF  (  C1  )  {                IF  (  C2  )  S1                ELSE  S2                }

The second interpretation is the one given in most programminglanguages having this construct. Each ELSE is associated withthe last preceding ``un-ELSE'd'' IF. In this example, considerthe situation where the parser has seen

        IF  (  C1  )  IF  (  C2  )  S1

and is looking at the ELSE. It can immediately reduce by thesimple-if rule to get

        IF  (  C1  )  stat

and then read the remaining input,

        ELSE  S2

and reduce

        IF  (  C1  )  stat  ELSE  S2

by the if-else rule. This leads to the first of the above groupingsof the input.

On the other hand, the ELSE may be shifted, S2 read, andthen the right hand portion of

        IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2

can be reduced by the if-else rule to get

        IF  (  C1  )  stat

which can be reduced by the simple-if rule. This leads to thesecond of the above groupings of the input, which is usuallydesired.

Once again the parser can do two valid things - there is ashift/reduce conflict. The application of disambiguating rule 1tells the parser to shift in this case, which leads to thedesired grouping.

This shift/reduce conflict arises only when there is a particular current input symbol, ELSE, and particular inputs alreadyseen, such as

        IF  (  C1  )  IF  (  C2  )  S1

In general, there may be many conflicts, and each one will beassociated with an input symbol and a set of previously readinputs. The previously read inputs are characterized by thestate of the parser.

The conflict messages of Yacc are best understood by examining the verbose (-v) option output file. For example, the outputcorresponding to the above conflict state might be:

23: shift/reduce conflict (shift 45, reduce 18) on ELSE

state 23

stat : IF ( cond ) stat_ (18) stat : IF ( cond ) stat_ELSE stat

ELSE shift 45 . reduce 18

The first line describes the conflict, giving the state and theinput symbol. The ordinary state description follows, giving thegrammar rules active in the state, and the parser actions.Recall that the underline marks the portion of the grammar ruleswhich has been seen. Thus in the example, in state 23 the parserhas seen input corresponding to

        IF  (  cond  )  stat

and the two grammar rules shown are active at this time. Theparser can do two possible things. If the input symbol is ELSE,it is possible to shift into state 45. State 45 will have, aspart of its description, the line

        stat  :  IF  (  cond  )  stat  ELSE_stat

since the ELSE will have been shifted in this state. Back instate 23, the alternative action, described by ``.'', is to bedone if the input symbol is not mentioned explicitly in the aboveactions; thus, in this case, if the input symbol is not ELSE, theparser reduces by grammar rule 18:

        stat  :  IF  '('  cond  ')'  stat

Once again, notice that the numbers following ``shift'' commandsrefer to other states, while the numbers following ``reduce''commands refer to grammar rule numbers. In the y.output file,the rule numbers are printed after those rules which can bereduced. In most one states, there will be at most reduce actionpossible in the state, and this will be the default command. Theuser who encounters unexpected shift/reduce conflicts will probablywant to look at the verbose output to decide whether thedefault actions are appropriate. In really tough cases, the usermight need to know more about the behavior and construction ofthe parser than can be covered here. In this case, one of thetheoretical references[2, 3, 4] might be consulted; the servicesof a local guru might also be appropriate.

6: Precedence

There is one common situation where the rules given abovefor resolving conflicts are not sufficient; this is in the parsingof arithmetic expressions. Most of the commonly used constructionsfor arithmetic expressions can be naturally describedby the notion of precedence levels for operators, together withinformation about left or right associativity. It turns out thatambiguous grammars with appropriate disambiguating rules can beused to create parsers that are faster and easier to write thanparsers constructed from unambiguous grammars. The basic notionis to write grammar rules of the form

        expr  :  expr  OP  expr

and

        expr  :  UNARY  expr

for all binary and unary operators desired. This creates a veryambiguous grammar, with many parsing conflicts. As disambiguatingrules, the user specifies the precedence, or bindingstrength, of all the operators, and the associativity of thebinary operators. This information is sufficient to allow Yaccto resolve the parsing conflicts in accordance with these rules,and construct a parser that realizes the desired precedences andassociativities.

The precedences and associativities are attached to tokensin the declarations section. This is done by a series of linesbeginning with a Yacc keyword: %left, %right, or %nonassoc, followedby a list of tokens. All of the tokens on the same line

are assumed to have the same precedence level and associativity;the lines are listed in order of increasing precedence or bindingstrength. Thus,

        %left  '+'  '-'        %left  '*'  '/'

describes the precedence and associativity of the four arithmeticoperators. Plus and minus are left associative, and have lowerprecedence than star and slash, which are also left associative.The keyword %right is used to describe right associative operators,and the keyword %nonassoc is used to describe operators,like the operator .LT. in Fortran, that may not associate withthemselves; thus,

        A  .LT.  B  .LT.  C

is illegal in Fortran, and such an operator would be describedwith the keyword %nonassoc in Yacc. As an example of thebehavior of these declarations, the description

        %right  '='        %left  '+'  '-'        %left  '*'  '/'        %%        expr    :       expr  '='  expr                |       expr  '+'  expr                |       expr  '-'  expr                |       expr  '*'  expr                |       expr  '/'  expr                |       NAME                ;

might be used to structure the input

        a  =  b  =  c*d  -  e  -  f*g

as follows:

        a = ( b = ( ((c*d)-e) - (f*g) ) )

When this mechanism is used, unary operators must, in general, begiven a precedence. Sometimes a unary operator and a binaryoperator have the same symbolic representation, but differentprecedences. An example is unary and binary '-'; unary minus maybe given the same strength as multiplication, or even higher,while binary minus has a lower strength than multiplication. Thekeyword, %prec, changes the precedence level associated with aparticular grammar rule. %prec appears immediately after thebody of the grammar rule, before the action or closing semicolon,and is followed by a token name or literal. It causes the precedenceof the grammar rule to become that of the following token

name or literal. For example, to make unary minus have the sameprecedence as multiplication the rules might resemble:

        %left  '+'  '-'        %left  '*'  '/'        %%        expr    :       expr  '+'  expr                |       expr  '-'  expr                |       expr  '*'  expr                |       expr  '/'  expr                |       '-'  expr      %prec  '*'                |       NAME                ;

A token declared by %left, %right, and %nonassoc need notbe, but may be, declared by %token as well.

The precedences and associativities are used by Yacc toresolve parsing conflicts; they give rise to disambiguatingrules. Formally, the rules work as follows:

1. The precedences and associativities are recorded for those tokens and literals that have them.

2. A precedence and associativity is associated with each grammarrule; it is the precedence and associativity of the last token or literal in the body of the rule. If the %prec construction is used, it overrides this default. Some grammar rules may have no precedence and associativity associated with them.

3. When there is a reduce/reduce conflict, or there is a shift/reduce conflict and either the input symbol or the grammar rule has no precedence and associativity, then the two disambiguating rules given at the beginning of the section are used, and the conflicts are reported.

4. If there is a shift/reduce conflict, and both the grammar rule and the input character have precedence and associativity associated with them, then the conflict is resolved in favor of the action (shift or reduce) associated with the higher precedence. If the precedences are the same, then the associativity is used; left associative implies reduce, right associative implies shift, and nonassociating implies error.

Conflicts resolved by precedence are not counted in thenumber of shift/reduce and reduce/reduce conflicts reported byYacc. This means that mistakes in the specification of precedencesmay disguise errors in the input grammar; it is a goodidea to be sparing with precedences, and use them in an

essentially ``cookbook'' fashion, until some experience has beengained. The y.output file is very useful in deciding whether theparser is actually doing what was intended.

7: Error Handling

Error handling is an extremely difficult area, and many ofthe problems are semantic ones. When an error is found, forexample, it may be necessary to reclaim parse tree storage,delete or alter symbol table entries, and, typically, setswitches to avoid generating any further output.

It is seldom acceptable to stop all processing when an erroris found; it is more useful to continue scanning the input tofind further syntax errors. This leads to the problem of gettingthe parser ``restarted'' after an error. A general class ofalgorithms to do this involves discarding a number of tokens fromthe input string, and attempting to adjust the parser so thatinput can continue.

To allow the user some control over this process, Yacc providesa simple, but reasonably general, feature. The token name``error'' is reserved for error handling. This name can be usedin grammar rules; in effect, it suggests places where errors areexpected, and recovery might take place. The parser pops itsstack until it enters a state where the token ``error'' is legal.It then behaves as if the token ``error'' were the current lookaheadtoken, and performs the action encountered. The lookaheadtoken is then reset to the token that caused the error. If nospecial error rules have been specified, the processing haltswhen an error is detected.

In order to prevent a cascade of error messages, the parser,after detecting an error, remains in error state until threetokens have been successfully read and shifted. If an error isdetected when the parser is already in error state, no message isgiven, and the input token is quietly deleted.

As an example, a rule of the form

        stat    :       error

would, in effect, mean that on a syntax error the parser wouldattempt to skip over the statement in which the error was seen.More precisely, the parser will scan ahead, looking for threetokens that might legally follow a statement, and start processingat the first of these; if the beginnings of statements arenot sufficiently distinctive, it may make a false start in themiddle of a statement, and end up reporting a second error wherethere is in fact no error.

Actions may be used with these special error rules. Theseactions might attempt to reinitialize tables, reclaim symboltable space, etc.

Error rules such as the above are very general, but difficult to control. Somewhat easier are rules such as

        stat    :       error  ';'

Here, when there is an error, the parser attempts to skip overthe statement, but will do so by skipping to the next ';'. Alltokens after the error and before the next ';' cannot be shifted,and are discarded. When the ';' is seen, this rule will bereduced, and any ``cleanup'' action associated with it performed.

Another form of error rule arises in interactive applications,where it may be desirable to permit a line to be reenteredafter an error. A possible error rule might be

        input   :       error  '\n'  {  printf( "Reenter last line: " );  }  input                                {       $$  =  $4;  }

There is one potential difficulty with this approach; the parsermust correctly process three input tokens before it admits thatit has correctly resynchronized after the error. If the reenteredline contains an error in the first two tokens, the parserdeletes the offending tokens, and gives no message; this isclearly unacceptable. For this reason, there is a mechanism thatcan be used to force the parser to believe that an error has beenfully recovered from. The statement

        yyerrok ;

in an action resets the parser to its normal mode. The lastexample is better written

        input   :       error  '\n'                                {       yyerrok;                                        printf( "Reenter last line: " );   }                        input                                {       $$  =  $4;  }                ;

As mentioned above, the token seen immediately after the``error'' symbol is the input token at which the error wasdiscovered. Sometimes, this is inappropriate; for example, anerror recovery action might take upon itself the job of findingthe correct place to resume input. In this case, the previouslookahead token must be cleared. The statement

        yyclearin ;

in an action will have this effect. For example, suppose theaction after error were to call some sophisticated resynchronizationroutine, supplied by the user, that attempted to advance theinput to the beginning of the next valid statement. After thisroutine was called, the next token returned by yylex would

presumably be the first token in a legal statement; the old,illegal token must be discarded, and the error state reset. Thiscould be done by a rule like

        stat    :       error                                {       resynch();                                        yyerrok ;                                        yyclearin ;   }                ;

These mechanisms are admittedly crude, but do allow for asimple, fairly effective recovery of the parser from many errors;moreover, the user can get control to deal with the error actionsrequired by other portions of the program.

8: The Yacc Environment

When the user inputs a specification to Yacc, the output isa file of C programs, called y.tab.c on most systems (due tolocal file system conventions, the names may differ from installationto installation). The function produced by Yacc is calledyyparse; it is an integer valued function. When it is called, itin turn repeatedly calls yylex, the lexical analyzer supplied bythe user (see Section 3) to obtain input tokens. Eventually,either an error is detected, in which case (if no error recoveryis possible) yyparse returns the value 1, or the lexical analyzerreturns the endmarker token and the parser accepts. In thiscase, yyparse returns the value 0.

The user must provide a certain amount of environment forthis parser in order to obtain a working program. For example,as with every C program, a program called main must be defined,that eventually calls yyparse. In addition, a routine calledyyerror prints a message when a syntax error is detected.

These two routines must be supplied in one form or anotherby the user. To ease the initial effort of using Yacc, a libraryhas been provided with default versions of main and yyerror. Thename of this library is system dependent; on many systems thelibrary is accessed by a -ly argument to the loader. To show thetriviality of these default programs, the source is given below:

        main(){                return( yyparse() );                }

and

        # include <stdio.h>        yyerror(s) char *s; {                fprintf( stderr, "%s\n", s );                }

The argument to yyerror is a string containing an error message,usually the string ``syntax error''. The average applicationwill want to do better than this. Ordinarily, the program shouldkeep track of the input line number, and print it along with themessage when a syntax error is detected. The external integervariable yychar contains the lookahead token number at the timethe error was detected; this may be of some interest in givingbetter diagnostics. Since the main program is probably suppliedby the user (to read arguments, etc.) the Yacc library is usefulonly in small projects, or in the earliest stages of larger ones.

The external integer variable yydebug is normally set to 0.If it is set to a nonzero value, the parser will output a verbosedescription of its actions, including a discussion of which inputsymbols have been read, and what the parser actions are. Dependingon the operating environment, it may be possible to set thisvariable by using a debugging system.

9: Hints for Preparing Specifications

This section contains miscellaneous hints on preparing efficient,easy to change, and clear specifications. The individualsubsections are more or less independent.

Input Style

It is difficult to provide rules with substantial actionsand still have a readable specification file. The followingstyle hints owe much to Brian Kernighan.

a. Use all capital letters for token names, all lower case letters for nonterminal names. This rule comes under the heading of ``knowing who to blame when things go wrong.''

b. Put grammar rules and actions on separate lines. This allows either to be changed without an automatic need to change the other.

c. Put all rules with the same left hand side together. Put the left hand side in only once, and let all following rules begin with a vertical bar.

d. Put a semicolon only after the last rule with a given left hand side, and put the semicolon on a separate line. This allows new rules to be easily added.

e. Indent rule bodies by two tab stops, and action bodies by three tab stops.

The example in Appendix A is written following this style,as are the examples in the text of this paper (where space permits).The user must make up his own mind about these stylisticquestions; the central problem, however, is to make the rulesvisible through the morass of action code.

Left Recursion

The algorithm used by the Yacc parser encourages so called``left recursive'' grammar rules: rules of the form

        name    :       name  rest_of_rule  ;

These rules frequently arise when writing specifications ofsequences and lists:

        list    :       item                |       list  ','  item                ;

and

        seq     :       item                |       seq  item                ;

In each of these cases, the first rule will be reduced for thefirst item only, and the second rule will be reduced for thesecond and all succeeding items.

With right recursive rules, such as

        seq     :       item                |       item  seq                ;

the parser would be a bit bigger, and the items would be seen,and reduced, from right to left. More seriously, an internalstack in the parser would be in danger of overflowing if a verylong sequence were read. Thus, the user should use left recursionwherever reasonable.

It is worth considering whether a sequence with zero elementshas any meaning, and if so, consider writing the sequencespecification with an empty rule:

        seq     :       /* empty */                |       seq  item                ;

Once again, the first rule would always be reduced exactly once,before the first item was read, and then the second rule would bereduced once for each item read. Permitting empty sequencesoften leads to increased generality. However, conflicts mightarise if Yacc is asked to decide which empty sequence it hasseen, when it hasn't seen enough to know!

Lexical Tie-ins

Some lexical decisions depend on context. For example, the

lexical analyzer might want to delete blanks normally, but notwithin quoted strings. Or names might be entered into a symboltable in declarations, but not in expressions.

One way of handling this situation is to create a globalflag that is examined by the lexical analyzer, and set byactions. For example, suppose a program consists of 0 or moredeclarations, followed by 0 or more statements. Consider:

        %{                int dflag;        %}          ...  other declarations ...        %%        prog    :       decls  stats                ;        decls   :       /* empty */                                {       dflag = 1;  }                |       decls  declaration                ;        stats   :       /* empty */                                {       dflag = 0;  }                |       stats  statement                ;            ...  other rules ...

The flag dflag is now 0 when reading statements, and 1 when readingdeclarations, except for the first token in the first statement.This token must be seen by the parser before it can tellthat the declaration section has ended and the statements havebegun. In many cases, this single token exception does notaffect the lexical scan.

This kind of ``backdoor'' approach can be elaborated to anoxious degree. Nevertheless, it represents a way of doing somethings that are difficult, if not impossible, to do otherwise.

Reserved Words

Some programming languages permit the user to use words like``if'', which are normally reserved, as label or variable names,provided that such use does not conflict with the legal use ofthese names in the programming language. This is extremely hardto do in the framework of Yacc; it is difficult to pass informationto the lexical analyzer telling it ``this instance of `if'is a keyword, and that instance is a variable''. The user canmake a stab at it, using the mechanism described in the last subsection,but it is difficult.

A number of ways of making this easier are under advisement.Until then, it is better that the keywords be reserved; that is,be forbidden for use as variable names. There are powerfulstylistic reasons for preferring this, anyway.

10: Advanced Topics

This section discusses a number of advanced features ofYacc.

Simulating Error and Accept in Actions

The parsing actions of error and accept can be simulated inan action by use of macros YYACCEPT and YYERROR. YYACCEPT causesyyparse to return the value 0; YYERROR causes the parser tobehave as if the current input symbol had been a syntax error;yyerror is called, and error recovery takes place. These mechanismscan be used to simulate parsers with multiple endmarkers orcontext-sensitive syntax checking.

Accessing Values in Enclosing Rules.

An action may refer to values returned by actions to theleft of the current rule. The mechanism is simply the same aswith ordinary actions, a dollar sign followed by a digit, but inthis case the digit may be 0 or negative. Consider

        sent    :       adj  noun  verb  adj  noun                                {  look at the sentence . . .  }                ;        adj     :       THE             {       $$ = THE;  }                |       YOUNG   {       $$ = YOUNG;  }                . . .                ;        noun    :       DOG                                {       $$ = DOG;  }                |       CRONE                                {       if( $0 == YOUNG ){                                                printf( "what?\n" );                                                }                                        $$ = CRONE;                                        }                ;                . . .

In the action following the word CRONE, a check is made that thepreceding token shifted was not YOUNG. Obviously, this is onlypossible when a great deal is known about what might precede thesymbol noun in the input. There is also a distinctly unstructuredflavor about this. Nevertheless, at times this mechanismwill save a great deal of trouble, especially when a few combinationsare to be excluded from an otherwise regular structure.

Support for Arbitrary Value Types

By default, the values returned by actions and the lexicalanalyzer are integers. Yacc can also support values of othertypes, including structures. In addition, Yacc keeps track ofthe types, and inserts appropriate union member names so that theresulting parser will be strictly type checked. The Yacc valuestack (see Section 4) is declared to be a union of the varioustypes of values desired. The user declares the union, and associatesunion member names to each token and nonterminal symbolhaving a value. When the value is referenced through a $$ or $nconstruction, Yacc will automatically insert the appropriateunion name, so that no unwanted conversions will take place. Inaddition, type checking commands such as Lint[5] will be far moresilent.

There are three mechanisms used to provide for this typing.First, there is a way of defining the union; this must be done bythe user since other programs, notably the lexical analyzer, mustknow about the union member names. Second, there is a way ofassociating a union member name with tokens and nonterminals.Finally, there is a mechanism for describing the type of thosefew values where Yacc can not easily determine the type.

To declare the union, the user includes in the declarationsection:

        %union  {                body of union ...                }

This declares the Yacc value stack, and the external variablesyylval and yyval, to have type equal to this union. If Yacc wasinvoked with the -d option, the union declaration is copied ontothe y.tab.h file. Alternatively, the union may be declared in aheader file, and a typedef used to define the variable YYSTYPE torepresent this union. Thus, the header file might also havesaid:

        typedef union {                body of union ...                } YYSTYPE;

The header file must be included in the declarations section, byuse of %{ and %}.

Once YYSTYPE is defined, the union member names must beassociated with the various terminal and nonterminal names. Theconstruction

        < name >

is used to indicate a union member name. If this follows one ofthe keywords %token, %left, %right, and %nonassoc, the union

member name is associated with the tokens listed. Thus, saying

        %left  <optype>  '+'  '-'

will cause any reference to values returned by these two tokensto be tagged with the union member name optype. Another keyword,%type, is used similarly to associate union member names withnonterminals. Thus, one might say

        %type  <nodetype>  expr  stat

There remain a couple of cases where these mechanisms areinsufficient. If there is an action within a rule, the valuereturned by this action has no a priori type. Similarly, referenceto left context values (such as $0 - see the previous subsection)leaves Yacc with no easy way of knowing the type. Inthis case, a type can be imposed on the reference by inserting aunion member name, between < and >, immediately after the first$. An example of this usage is

        rule    :       aaa  {  $<intval>$  =  3;  } bbb                                {       fun( $<intval>2, $<other>0 );  }                ;

This syntax has little to recommend it, but the situation arisesrarely.

A sample specification is given in Appendix C. The facilitiesin this subsection are not triggered until they are used: inparticular, the use of %type will turn on these mechanisms. Whenthey are used, there is a fairly strict level of checking. Forexample, use of $n or $$ to refer to something with no definedtype is diagnosed. If these facilities are not triggered, theYacc value stack is used to hold int's, as was true historically.

11: Acknowledgements

Yacc owes much to a most stimulating collection of users,who have goaded me beyond my inclination, and frequently beyondmy ability, in their endless search for ``one more feature''.Their irritating unwillingness to learn how to do things my wayhas usually led to my doing things their way; most of the time,they have been right. B. W. Kernighan, P. J. Plauger, S. I.Feldman, C. Imagna, M. E. Lesk, and A. Snyder will recognize someof their ideas in the current version of Yacc. C. B. Haley contributedto the error recovery algorithm. D. M. Ritchie, B. W.Kernighan, and M. O. Harris helped translate this document intoEnglish. Al Aho also deserves special credit for bringing themountain to Mohammed, and other favors.

References

1. B. W. Kernighan and D. M. Ritchie, The C Programming Language, Prentice-Hall, Englewood Cliffs, New Jersey, 1978.

2. A. V. Aho and S. C. Johnson, "LR Parsing," Comp. Surveys, vol. 6, no. 2, pp. 99-124, June 1974.

3. A. V. Aho, S. C. Johnson, and J. D. Ullman, "Deterministic Parsing of Ambiguous Grammars," Comm. Assoc. Comp. Mach., vol. 18, no. 8, pp. 441-452, August 1975.

4. A. V. Aho and J. D. Ullman, Principles of Compiler Design, Addison-Wesley, Reading, Mass., 1977.

5. S. C. Johnson, "Lint, a C Program Checker," Comp. Sci. Tech. Rep. No. 65, 1978 .]. updated version TM 78-1273-3

6. S. C. Johnson, "A Portable Compiler: Theory and Practice," Proc. 5th ACM Symp. on Principles of Programming Languages, pp. 97-104, January 1978.

7. B. W. Kernighan and L. L. Cherry, "A System for Typesetting Mathematics," Comm. Assoc. Comp. Mach., vol. 18, pp. 151-157, Bell Laboratories, Murray Hill, New Jersey, March 1975 .].

8. M. E. Lesk, "Lex - A Lexical Analyzer Generator," Comp. Sci. Tech. Rep. No. 39, Bell Laboratories, Murray Hill, New Jersey, October 1975 .].

Appendix A: A Simple Example

This example gives the complete Yacc specification for asmall desk calculator; the desk calculator has 26 registers,labeled ``a'' through ``z'', and accepts arithmetic expressionsmade up of the operators +, -, *, /, % (mod operator), & (bitwiseand), | (bitwise or), and assignment. If an expression at thetop level is an assignment, the value is not printed; otherwiseit is. As in C, an integer that begins with 0 (zero) is assumedto be octal; otherwise, it is assumed to be decimal.

As an example of a Yacc specification, the desk calculatordoes a reasonable job of showing how precedences and ambiguitiesare used, and demonstrating simple error recovery. The majoroversimplifications are that the lexical analysis phase is muchsimpler than for most applications, and the output is producedimmediately, line by line. Note the way that decimal and octalintegers are read in by the grammar rules; This job is probablybetter done by the lexical analyzer.

%{#  include  <stdio.h>#  include  <ctype.h>int  regs[26];int  base;%}%start  list%token  DIGIT  LETTER%left  '|'%left  '&'%left  '+'  '-'%left  '*'  '/'  '%'%left  UMINUS      /*  supplies  precedence  for  unary  minus  */%%      /*  beginning  of  rules  section  */list :    /*  empty  */     |    list  stat  '\n'     |    list  error  '\n'               {    yyerrok;  }     ;stat :    expr               {    printf( "%d\n", $1 );  }     |    LETTER  '='  expr               {    regs[$1]  =  $3;  }     ;expr :    '('  expr  ')'               {    $$  =  $2;  }     |    expr  '+'  expr               {    $$  =  $1  +  $3;  }     |    expr  '-'  expr               {    $$  =  $1  -  $3;  }     |    expr  '*'  expr               {    $$  =  $1  *  $3;  }     |    expr  '/'  expr               {    $$  =  $1  /  $3;  }     |    expr  '%'  expr               {    $$  =  $1  %  $3;  }     |    expr  '&'  expr               {    $$  =  $1  &  $3;  }     |    expr  '|'  expr               {    $$  =  $1  |  $3;  }     |    '-'  expr        %prec  UMINUS               {    $$  =  -  $2;  }     |    LETTER               {    $$  =  regs[$1];  }     |    number     ;number    :    DIGIT               {    $$ = $1;    base  =  ($1==0)  ?  8  :  10;  }     |    number  DIGIT               {    $$  =  base * $1  +  $2;  }     ;%%      /*  start  of  programs  */yylex() {      /*  lexical  analysis  routine  */              /*  returns  LETTER  for  a  lower  case  letter,  yylval = 0  through  25  */              /*  return  DIGIT  for  a  digit,  yylval = 0  through  9  */              /*  all  other  characters  are  returned  immediately  */     int  c;     while(  (c=getchar())  ==  ' '  )  {/*  skip  blanks  */  }     /*  c  is  now  nonblank  */     if(  islower(  c  )  )  {          yylval  =  c  -  'a';          return  (  LETTER  );          }     if(  isdigit(  c  )  )  {          yylval  =  c  -  '0';          return(  DIGIT  );          }     return(  c  );     }

Appendix B: Yacc Input Syntax

This Appendix has a description of the Yacc input syntax, asa Yacc specification. Context dependencies, etc., are not considered.Ironically, the Yacc input specification language ismost naturally specified as an LR(2) grammar; the sticky partcomes when an identifier is seen in a rule, immediately followingan action. If this identifier is followed by a colon, it is thestart of the next rule; otherwise it is a continuation of thecurrent rule, which just happens to have an action embedded init. As implemented, the lexical analyzer looks ahead after seeingan identifier, and decide whether the next token (skippingblanks, newlines, comments, etc.) is a colon. If so, it returnsthe token C_IDENTIFIER. Otherwise, it returns IDENTIFIER.Literals (quoted strings) are also returned as IDENTIFIERS, butnever as part of C_IDENTIFIERs.

            /*  grammar  for  the  input  to  Yacc  */      /*  basic  entities  */%token      IDENTIFIER  /*   includes  identifiers   and  literals  */%token      C_IDENTIFIER      /*    identifier  (but  not  literal)  followed  by  colon    */%token      NUMBER            /*    [0-9]+    */      /*  reserved  words:    %type  =>  TYPE,  %left  =>  LEFT,  etc.  */%token      LEFT  RIGHT  NONASSOC  TOKEN  PREC  TYPE  START  UNION%token      MARK  /*  the  %%  mark  */%token      LCURL /*  the  %{  mark  */%token      RCURL /*  the  %}  mark  */      /*  ascii  character  literals  stand  for  themselves  */%start      spec%%spec  :     defs  MARK  rules  tail      ;tail  :     MARK  {    In  this  action,  eat  up  the  rest  of  the  file    }      |     /*  empty:  the  second  MARK  is  optional  */      ;defs  :     /*  empty  */      |     defs  def      ;def   :     START  IDENTIFIER      |     UNION  {  Copy union  definition  to  output  }      |     LCURL  {  Copy  C  code  to  output  file   }  RCURL      |     ndefs  rword  tag  nlist      ;rword :     TOKEN      |     LEFT      |     RIGHT      |     NONASSOC      |     TYPE      ;tag   :     /*  empty:  union  tag  is  optional  */      |     '<'  IDENTIFIER  '>'      ;nlist :     nmno      |     nlist  nmno      |     nlist  ','  nmno      ;nmno  :     IDENTIFIER        /*  NOTE:  literal  illegal  with  %type  */      |     IDENTIFIER  NUMBER      /*  NOTE:  illegal  with  %type  */      ;      /*  rules  section  */rules :     C_IDENTIFIER  rbody  prec      |     rules  rule      ;rule  :     C_IDENTIFIER  rbody  prec      |     '|'  rbody  prec      ;rbody :     /*  empty  */      |     rbody  IDENTIFIER      |     rbody  act      ;act   :     '{'  {  Copy  action,  translate  $$,  etc.  }  '}'      ;prec  :     /*  empty  */      |     PREC  IDENTIFIER      |     PREC  IDENTIFIER  act      |     prec  ';'      ;

Appendix C: An Advanced Example

This Appendix gives an example of a grammar using some ofthe advanced features discussed in Section 10. The desk calculatorexample in Appendix A is modified to provide a desk calculatorthat does floating point interval arithmetic. The calculatorunderstands floating point constants, the arithmetic operations+, -, *, /, unary -, and = (assignment), and has 26 floatingpoint variables, ``a'' through ``z''. Moreover, it also understandsintervals, written

                ( x , y )

where x is less than or equal to y. There are 26 interval valuedvariables ``A'' through ``Z'' that may also be used. The usageis similar to that in Appendix A; assignments return no value,and print nothing, while expressions print the (floating orinterval) value.

This example explores a number of interesting features ofYacc and C. Intervals are represented by a structure, consistingof the left and right endpoint values, stored as double's. Thisstructure is given a type name, INTERVAL, by using typedef. TheYacc value stack can also contain floating point scalars, andintegers (used to index into the arrays holding the variablevalues). Notice that this entire strategy depends strongly onbeing able to assign structures and unions in C. In fact, manyof the actions call functions that return structures as well.

It is also worth noting the use of YYERROR to handle errorconditions: division by an interval containing 0, and an intervalpresented in the wrong order. In effect, the error recoverymechanism of Yacc is used to throw away the rest of the offendingline.

In addition to the mixing of types on the value stack, thisgrammar also demonstrates an interesting use of syntax to keeptrack of the type (e.g. scalar or interval) of intermediateexpressions. Note that a scalar can be automatically promoted toan interval if the context demands an interval value. Thiscauses a large number of conflicts when the grammar is runthrough Yacc: 18 Shift/Reduce and 26 Reduce/Reduce. The problemcan be seen by looking at the two input lines:

                2.5 + ( 3.5 - 4. )

and

                2.5 + ( 3.5 , 4. )

Notice that the 2.5 is to be used in an interval valued expressionin the second example, but this fact is not known until the``,'' is read; by this time, 2.5 is finished, and the parser cannotgo back and change its mind. More generally, it might be

necessary to look ahead an arbitrary number of tokens to decidewhether to convert a scalar to an interval. This problem isevaded by having two rules for each binary interval valued operator:one when the left operand is a scalar, and one when the leftoperand is an interval. In the second case, the right operandmust be an interval, so the conversion will be applied automatically.Despite this evasion, there are still many cases wherethe conversion may be applied or not, leading to the above conflicts.They are resolved by listing the rules that yieldscalars first in the specification file; in this way, the conflictswill be resolved in the direction of keeping scalar valuedexpressions scalar valued until they are forced to become intervals.

This way of handling multiple types is very instructive, butnot very general. If there were many kinds of expression types,instead of just two, the number of rules needed would increasedramatically, and the conflicts even more dramatically. Thus,while this example is instructive, it is better practice in amore normal programming language environment to keep the typeinformation as part of the value, and not as part of the grammar.

Finally, a word about the lexical analysis. The onlyunusual feature is the treatment of floating point constants.The C library routine atof is used to do the actual conversionfrom a character string to a double precision value. If the lexicalanalyzer detects an error, it responds by returning a tokenthat is illegal in the grammar, provoking a syntax error in theparser, and thence error recovery.

%{#  include  <stdio.h>#  include  <ctype.h>typedef  struct  interval  {        double  lo,  hi;        }  INTERVAL;INTERVAL  vmul(),  vdiv();double  atof();double  dreg[ 26 ];INTERVAL  vreg[ 26 ];%}%start    lines%union    {        int  ival;        double  dval;        INTERVAL  vval;        }%token  <ival>  DREG  VREG      /*  indices  into  dreg,  vreg  arrays  */%token  <dval>  CONST           /*  floating  point  constant  */%type  <dval>  dexp             /*  expression  */%type  <vval>  vexp             /*  interval  expression  */        /*  precedence  information  about  the  operators  */%left   '+'  '-'%left   '*'  '/'%left   UMINUS        /*  precedence  for  unary  minus  */%%lines   :       /*  empty  */        |       lines  line        ;line    :       dexp  '\n'                        {       printf(  "%15.8f\n",  $1  );  }        |       vexp  '\n'                        {       printf(  "(%15.8f  ,  %15.8f  )\n",  $1.lo,  $1.hi  );  }        |       DREG  '='  dexp  '\n'                        {       dreg[$1]  =  $3;  }        |       VREG  '='  vexp  '\n'                        {       vreg[$1]  =  $3;  }        |       error  '\n'                        {       yyerrok;  }        ;dexp    :       CONST        |       DREG                        {       $$  =  dreg[$1];  }        |       dexp  '+'  dexp                        {       $$  =  $1  +  $3;  }        |       dexp  '-'  dexp                        {       $$  =  $1  -  $3;  }        |       dexp  '*'  dexp                        {       $$  =  $1  *  $3;  }        |       dexp  '/'  dexp                        {       $$  =  $1  /  $3;  }        |       '-'  dexp       %prec  UMINUS                        {       $$  =  - $2;  }        |       '('  dexp  ')'                        {       $$  =  $2;  }        ;vexp    :       dexp                        {       $$.hi  =  $$.lo  =  $1;  }        |       '('  dexp  ','  dexp  ')'                        {                        $$.lo  =  $2;                        $$.hi  =  $4;                        if(  $$.lo  >  $$.hi  ){                                printf(  "interval  out  of  order\n"  );                                YYERROR;                                }                        }        |       VREG                        {       $$  =  vreg[$1];    }        |       vexp  '+'  vexp                        {       $$.hi  =  $1.hi  +  $3.hi;                                $$.lo  =  $1.lo  +  $3.lo;    }        |       dexp  '+'  vexp                        {       $$.hi  =  $1  +  $3.hi;                                $$.lo  =  $1  +  $3.lo;    }        |       vexp  '-'  vexp                        {       $$.hi  =  $1.hi  -  $3.lo;                                $$.lo  =  $1.lo  -  $3.hi;    }        |       dexp  '-'  vexp                        {       $$.hi  =  $1  -  $3.lo;                                $$.lo  =  $1  -  $3.hi;    }        |       vexp  '*'  vexp                        {       $$  =  vmul(  $1.lo,  $1.hi,  $3  );  }        |       dexp  '*'  vexp                        {       $$  =  vmul(  $1,  $1,  $3  );  }        |       vexp  '/'  vexp                        {       if(  dcheck(  $3  )  )  YYERROR;                                $$  =  vdiv(  $1.lo,  $1.hi,  $3  );  }        |       dexp  '/'  vexp                        {       if(  dcheck(  $3  )  )  YYERROR;                                $$  =  vdiv(  $1,  $1,  $3  );  }        |       '-'  vexp       %prec  UMINUS                        {       $$.hi  =  -$2.lo;    $$.lo  =  -$2.hi;    }        |       '('  vexp  ')'                        {       $$  =  $2;  }        ;%%#  define  BSZ  50        /*  buffer  size  for  floating  point  numbers  */        /*  lexical  analysis  */yylex(){        register  c;        while(  (c=getchar())  ==  ' '  ){  /*  skip  over  blanks  */  }        if(  isupper(  c  )  ){                yylval.ival  =  c  -  'A';                return(  VREG  );                }        if(  islower(  c  )  ){                yylval.ival  =  c  -  'a';                return(  DREG  );                }        if(  isdigit(  c  )  ||  c=='.'  ){                /*  gobble  up  digits,  points,  exponents  */                char  buf[BSZ+1],  *cp  =  buf;                int  dot  =  0,  exp  =  0;                for(  ;  (cp-buf)<BSZ  ;  ++cp,c=getchar()  ){                        *cp  =  c;                        if(  isdigit(  c  )  )  continue;                        if(  c  ==  '.'  ){                                if(  dot++  ||  exp  )  return(  '.'  );    /*  will  cause  syntax  error  */                                continue;                                }                        if(  c  ==  'e'  ){                                if(  exp++  )  return(  'e'  );    /*  will  cause  syntax  error  */                                continue;                                }                        /*  end  of  number  */                        break;                        }                *cp  =  '\0';                if(  (cp-buf)  >=  BSZ  )  printf(  "constant  too  long:  truncated\n"  );                else  ungetc(  c,  stdin  );    /*  push  back  last  char  read  */                yylval.dval  =  atof(  buf  );                return(  CONST  );                }        return(  c  );        }INTERVAL  hilo(  a,  b,  c,  d  )  double  a,  b,  c,  d;  {        /*  returns  the  smallest  interval  containing  a,  b,  c,  and  d  */        /*  used  by  *,  /  routines  */        INTERVAL  v;        if(  a>b  )  {  v.hi  =  a;    v.lo  =  b;  }        else  {  v.hi  =  b;    v.lo  =  a;  }        if(  c>d  )  {                if(  c>v.hi  )  v.hi  =  c;                if(  d<v.lo  )  v.lo  =  d;                }        else  {                if(  d>v.hi  )  v.hi  =  d;                if(  c<v.lo  )  v.lo  =  c;                }        return(  v  );        }INTERVAL  vmul(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {        return(  hilo(  a*v.hi,  a*v.lo,  b*v.hi,  b*v.lo  )  );        }dcheck(  v  )  INTERVAL  v;  {        if(  v.hi  >=  0.  &&  v.lo  <=  0.  ){                printf(  "divisor  interval  contains  0.\n"  );                return(  1  );                }        return(  0  );        }INTERVAL  vdiv(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {        return(  hilo(  a/v.hi,  a/v.lo,  b/v.hi,  b/v.lo  )  );        }

Appendix D: Old Features Supported but not Encouraged

This Appendix mentions synonyms and features which are supportedfor historical continuity, but, for various reasons, arenot encouraged.

1. Literals may also be delimited by double quotes ``"''.

2. Literals may be more than one character long. If all the characters are alphabetic, numeric, or _, the type number of the literal is defined, just as if the literal did not have the quotes around it. Otherwise, it is difficult to find the value for such literals.

The use of multi-character literals is likely to mislead those unfamiliar with Yacc, since it suggests that Yacc is doing a job which must be actually done by the lexical analyzer.

3. Most places where % is legal, backslash ``\'' may be used. In particular, \\ is the same as %%, \left the same as %left, etc.

4. There are a number of other synonyms:

             %< is the same as %left             %> is the same as %right             %binary and %2 are the same as %nonassoc             %0 and %term are the same as %token             %= is the same as %prec

5. Actions may also have the form

             ={ . . . }

and the curly braces can be dropped if the action is a singleC statement.

6. C code between %{ and %} used to be permitted at the head of the rules section, as well as in the declaration section.