Lex is a program generator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and produces a program in a general purpose language which recognizes regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions. At the bound~aries between strings program sections provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed.
The user supplies the additional code beyond expression matching needed to complete his tasks, possibly including code written by other generators. The program that recognizes the expressions is generated in the general purpose programming language employed for the user's program fragments. Thus, a high level expression language is provided to write the string expressions to be matched while the user's freedom to write actions is unimpaired. This avoids forcing the user who wishes to use a string manipulation language for input analysis to write processing programs in the same and often inappropriate string handling language.
Lex is not a complete language, but rather a generator representing a new language feature which can be added to different programming languages, called ``host languages.'' Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in different host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also provided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combination of hardware and host language appropriate to the task, the user's background, and the properties of local implementations. At present, the only supported host language is C, although Fortran (in the form of Ratfor [2] has been available in the past. Lex itself exists on UNIX, GCOS, and OS/370; but the code generated by Lex may be taken anywhere the appropriate compilers exist.
Lex turns the user's expressions and actions
(called
source
in this memo) into the host general-purpose language;
the generated program is named
yylex.
The
yylex
program
will recognize expressions
in a stream
(called
input
in this memo)
and perform the specified actions for each expression as it is detected.
See Figure 1.
center;
l _ r
l|c|r
l _ r
l _ r
l|c|r
l _ r
c s s
c s s.
Source -> Lex -> yylex
Input -> yylex -> Output
An overview of Lex
Figure 1
For a trivial example, consider a program to delete
from the input
all blanks or tabs at the ends of lines.
center;
l l.
%%
[ \t]+$ ;
is all that is required.
The program
contains a %% delimiter to mark the beginning of the rules, and
one rule.
This rule contains a regular expression
which matches one or more
instances of the characters blank or tab
(written \t for visibility, in accordance with the C language convention)
just prior to the end of a line.
The brackets indicate the character
class made of blank and tab; the + indicates ``one or more ...'';
and the $ indicates ``end of line,'' as in QED.
No action is specified,
so the program generated by Lex (yylex) will ignore these characters.
Everything else will be copied.
To change any remaining
string of blanks or tabs to a single blank,
add another rule:
center;
l l.
%%
[ \t]+$ ;
[ \t]+ printf(" ");
The finite automaton generated for this
source will scan for both rules at once,
observing at
the termination of the string of blanks or tabs
whether or not there is a newline character, and executing
the desired rule action.
The first rule matches all strings of blanks or tabs
at the end of lines, and the second
rule all remaining strings of blanks or tabs.
Lex can be used alone for simple transformations, or
for analysis and statistics gathering on a lexical level.
Lex can also be used with a parser generator
to perform the lexical analysis phase; it is particularly
easy to interface Lex and Yacc [3].
Lex programs recognize only regular expressions;
Yacc writes parsers that accept a large class of context free grammars,
but require a lower level analyzer to recognize input tokens.
Thus, a combination of Lex and Yacc is often appropriate.
When used as a preprocessor for a later parser generator,
Lex is used to partition the input stream,
and the parser generator assigns structure to
the resulting pieces.
The flow of control
in such a case (which might be the first half of a compiler,
for example) is shown in Figure 2.
Additional programs,
written by other generators
or by hand, can
be added easily to programs written by Lex.
center;
l c c c l
l c c c l
l c c c l
l _ c _ l
l|c|c|c|l
l _ c _ l
l c c c l
l _ c _ l
l|c|c|c|l
l _ c _ l
l c s s l
l c s s l.
lexical grammar
rules rules
\(da \(da
Lex Yacc
\(da \(da
Input -> yylex -> yyparse -> Parsed input
Lex with Yacc
Figure 2
Yacc users
will realize that the name
yylex
is what Yacc expects its lexical analyzer to be named,
so that the use of this name by Lex simplifies
interfacing.
Lex generates a deterministic finite automaton from the regular expressions in the source [4]. The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex program to recognize and partition an input stream is proportional to the length of the input. The number of Lex rules or the complexity of the rules is not important in determining speed, unless rules which include forward context require a significant amount of re~scanning. What does increase with the number and complexity of rules is the size of the finite automaton, and therefore the size of the program generated by Lex.
In the program written by Lex, the user's fragments (representing the actions to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the control flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subroutines outside this action routine.
Lex is not limited to source which can be interpreted on the basis of one character look~ahead. For example, if there are two rules, one looking for ab and another for abcdefg, and the input stream is abcdefh, Lex will recognize ab and leave the input pointer just before cd. . . Such backup is more costly than the processing of simpler languages.