Lex can handle ambiguous specifications. When more than one expression can match the current input, Lex chooses as follows:
Thus, suppose the rules
center;
l l.
integer keyword action ...;
[a-z]+ identifier action ...;
to be given in that order. If the input is
integers,
it is taken as an identifier, because
[a-z]+
matches 8 characters while
integer
matches only 7.
If the input is
integer,
both rules match 7 characters, and
the keyword rule is selected because it was given first.
Anything shorter (e.g. int) will
not match the expression
integer
and so the identifier interpretation is used.
The principle of preferring the longest
match makes rules containing
expressions like
.*
dangerous.
For example,
center;
l.
'.*'
might seem a good way of recognizing
a string in single quotes.
But it is an invitation for the program to read far
ahead, looking for a distant
single quote.
Presented with the input
center;
l l.
'first' quoted string here, 'second' here
the above expression will match
center;
l l.
'first' quoted string here, 'second'
which is probably not what was wanted.
A better rule is of the form
center;
l.
'[^'\n]*'
which, on the above input, will stop
after
'first'.
The consequences
of errors like this are mitigated by the fact
that the
.
operator will not match newline.
Thus expressions like
.*
stop on the
current line.
Don't try to defeat this with expressions like
[.\n]+
or
equivalents;
the Lex generated program will try to read
the entire input file, causing
internal buffer overflows.
Note that Lex is normally partitioning
the input stream, not searching for all possible matches
of each expression.
This means that each character is accounted for
once and only once.
For example, suppose it is desired to
count occurrences of both she and he in an input text.
Some Lex rules to do this might be
center;
l l.
she s++;
he h++;
\n |
. ;
where the last two rules ignore everything besides he and she.
Remember that . does not include newline.
Since she includes he, Lex will normally
not
recognize
the instances of he included in she,
since once it has passed a she those characters are gone.
Sometimes the user would like to override this choice. The action
REJECT
means ``go do the next alternative.''
It causes whatever rule was second choice after the current
rule to be executed.
The position of the input pointer is adjusted accordingly.
Suppose the user really wants to count the included instances of he:
center;
l l.
she {s++; REJECT;}
he {h++; REJECT;}
\n |
. ;
these rules are one way of changing the previous example
to do just that.
After counting each expression, it is rejected; whenever appropriate,
the other expression will then be counted. In this example, of course,
the user could note that she includes he but not
vice versa, and omit the REJECT action on he;
in other cases, however, it
would not be possible a priori to tell
which input characters
were in both classes.
Consider the two rules
center;
l l.
a[bc]+ { ... ; REJECT;}
a[cd]+ { ... ; REJECT;}
If the input is
ab,
only the first rule matches,
and on
ad
only the second matches.
The input string
accb
matches the first rule for four characters
and then the second rule for three characters.
In contrast, the input
accd
agrees with
the second rule for four characters and then the first
rule for three.
In general, REJECT is useful whenever
the purpose of Lex is not to partition the input
stream but to detect all examples of some items
in the input, and the instances of these items
may overlap or include each other.
Suppose a digram table of the input is desired;
normally the digrams overlap, that is the word
the
is considered to contain
both
th
and
he.
Assuming a two-dimensional array named
digram
to be incremented, the appropriate
source is
center;
l l.
%%
[a-z][a-z] {digram[yytext[0]][yytext[1]]++; REJECT;}
\n ;
where the REJECT is necessary to pick up
a letter pair beginning at every character, rather than at every
other character.