When an expression written as above is matched, Lex executes the corresponding action. This section describes some features of Lex which aid in writing actions. Note that there is a default action, which consists of copying the input to the output. This is performed on all strings not otherwise matched. Thus the Lex user who wishes to absorb the entire input, without producing any output, must provide rules to match everything. When Lex is being used with Yacc, this is the normal situation. One may consider that actions are what is done instead of copying the input to the output; thus, in general, a rule which merely copies can be omitted. Also, a character combination which is omitted from the rules and which appears as input is likely to be printed on the output, thus calling attention to the gap in the rules.
One of the simplest things that can be done is to ignore
the input. Specifying a C null statement, ; as an action
causes this result. A frequent rule is
center;
l l.
[ \t\n] ;
which causes the three spacing characters (blank, tab, and newline)
to be ignored.
Another easy way to avoid writing actions is the action character
|, which indicates that the action for this rule is the action
for the next rule.
The previous example could also have been written
center;
l l.
" " |
"\t" |
"\n" ;
with the same result, although in different style.
The quotes around \n and \t are not required.
In more complex actions, the user
will
often want to know the actual text that matched some expression
like
[a-z]+.
Lex leaves this text in an external character
array named
yytext.
Thus, to print the name found,
a rule like
center;
l l.
[a-z]+ printf("%s", yytext);
will print
the string in
yytext.
The C function
printf
accepts a format argument and data to be printed;
in this case, the format is ``print string'' (% indicating
data conversion, and
s
indicating string type),
and the data are the characters
in
yytext.
So this just places
the matched string
on the output.
This action
is so common that
it may be written as ECHO:
center;
l l.
[a-z]+ ECHO;
is the same as the above.
Since the default action is just to
print the characters found, one might ask why
give a rule, like this one, which merely specifies
the default action?
Such rules are often required
to avoid matching some other rule
which is not desired. For example, if there is a rule
which matches
read
it will normally match the instances of
read
contained in
bread
or
readjust;
to avoid
this,
a rule
of the form
[a-z]+
is needed.
This is explained further below.
Sometimes it is more convenient to know the end of what
has been found; hence Lex also provides a count
yyleng
of the number of characters matched.
To count both the number
of words and the number of characters in words in the input, the user might write
center;
l l.
[a-zA-Z]+ {words++; chars += yyleng;}
which accumulates in
chars
the number
of characters in the words recognized.
The last character in the string matched can
be accessed by
center;
l.
yytext[yyleng-1]
Occasionally, a Lex action may decide that a rule has not recognized the correct span of characters. Two routines are provided to aid with this situation. First, yymore() can be called to indicate that the next input expression recognized is to be tacked on to the end of this input. Normally, the next input string would overwrite the current entry in yytext. Second, yyless (n) may be called to indicate that not all the characters matched by the currently successful expression are wanted right now. The argument n indicates the number of characters in yytext to be retained. Further characters previously matched are returned to the input. This provides the same sort of look~ahead offered by the / operator, but in a different form.
Example:
Consider a language which defines
a string as a set of characters between quotation (") marks, and provides that
to include a " in a string it must be preceded by a \. The
regular expression which matches that is somewhat confusing,
so that it might be preferable to write
center;
l l.
\"[^"]* {
if (yytext[yyleng-1] == '\\')
yymore();
else
... normal user processing
}
which will, when faced with a string such as
"abc\"def"
first match
the five characters
"abc\;
then
the call to
yymore()
will
cause the next part of the string,
"def,
to be tacked on the end.
Note that the final quote terminating the string should be picked
up in the code labeled ``normal processing''.
The function
yyless()
might be used to reprocess
text in various circumstances. Consider the C problem of distinguishing
the ambiguity of ``=-a''.
Suppose it is desired to treat this as ``=- a''
but print a message. A rule might be
center;
l l.
=-[a-zA-Z] {
printf("Operator (=-) ambiguous\n");
yyless(yyleng-1);
... action for =- ...
}
which prints a message, returns the letter after the
operator to the input stream, and treats the operator as ``=-''.
Alternatively it might be desired to treat this as ``= -a''.
To do this, just return the minus
sign as well as the letter to the input:
center;
l l.
=-[a-zA-Z] {
printf("Operator (=-) ambiguous\n");
yyless(yyleng-2);
... action for = ...
}
will perform the other interpretation.
Note that the expressions for the two cases might more easily
be written
center;
l l.
=-/[A-Za-z]
in the first case and
center;
l.
=/-[A-Za-z]
in the second;
no backup would be required in the rule action.
It is not necessary to recognize the whole identifier
to observe the ambiguity.
The
possibility of ``=-3'', however, makes
center;
l.
=-/[^ \t\n]
a still better rule.
In addition to these routines, Lex also permits access to the I/O routines it uses. They are:
By default these routines are provided as macro definitions, but the user can override them and supply private versions. These routines define the relationship between external files and internal characters, and must all be retained or modified consistently. They may be redefined, to cause input or output to be transmitted to or from strange places, including other programs or internal memory; but the character set used must be consistent in all routines; a value of zero returned by input must mean end of file; and the relationship between unput and input must be retained or the Lex look~ahead will not work. Lex does not look ahead at all if it does not have to, but every rule ending in + * ? or $ or containing / implies look~ahead. Look~ahead is also necessary to match an expression that is a prefix of another expression. See below for a discussion of the character set used by Lex. The standard Lex library imposes a 100 character limit on backup.
Another Lex library routine that the user will sometimes want to redefine is yywrap() which is called whenever Lex reaches an end-of-file. If yywrap returns a 1, Lex continues with the normal wrapup on end of input. Sometimes, however, it is convenient to arrange for more input to arrive from a new source. In this case, the user should provide a yywrap which arranges for new input and returns 0. This instructs Lex to continue processing. The default yywrap always returns 1.
This routine is also a convenient place to print tables, summaries, etc. at the end of a program. Note that it is not possible to write a normal rule which recognizes end-of-file; the only access to this condition is through yywrap. In fact, unless a private version of input() is supplied a file containing nulls cannot be handled, since a value of 0 returned by input is taken to be end-of-file.