A popular lexical analyzer

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract: Lex which is a popular lexical analyzer generator provides support for writing programs which have a control flow directed by instances of regular expressions in a given input stream. It is most suitable for editor-script type translations and for segmenting input in order to prepare for a parsing routine. The programs that perform lexical analysis written with Lex are able to accept ambiguous specifications and select the longest match possible at each input point.

Index terms: Phases of a compiler, Structure of a Lex file, Lex regular expressions, Operators, Lex predefined variables, Lex library routines, Conclusion.


Lex is a program generator which is designed for the purpose of lexical processing of character input streams. It is able to accept a high level, problem oriented specification for matching character string, and generates a program in a general purpose language that can recognize regular expressions. The regular expressions are specified by the user in the source specifications given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expressions.


Basic functions of a Lexical Analyzer:-

  • To group sequence of characters into lexemes - smallest meaningful entity in a language (keywords, identifiers, constants)
  • To read characters from a file and store them in a buffer which helps to decrease the latency that occurred due to i/o. Lexical analyzer manages this buffer
  • To makes use of the theory of regular languages and finite state machines
  • Lex is an instrument that constructs lexical analyzers from regular expression specifications.


Lex source is a table of regular expressions and comparable corresponding fragments of the program. This table is then transformed to a program which reads a stream that is input, copying it to an output stream and then partitioning the input into strings which then match the given expressions. As every such string gets recognized, the corresponding program fragment gets executed. The recognition of the expressions is executed by a deterministic finite automation that is generated by Lex.

The program fragments that are written by the user are then executed in the same order in which the corresponding regular expressions occur in the input stream.

Lex turns the user's expressions and actions (also called source) into the host general-purpose language; the generated program is called as yylex. The yylex program will recognize expressions in a stream (which is called input) and perform the specified actions for each expression as it is detected.

Á Where the definitions and the user subroutines can often be omitted. The second %% is also optional, but the first one is required to mark the beginning of the rules. So, we can say that the minimum Lex program can be as:

%% (no definitions, no rules)

Á It can be translated into a program which can copy the input to the output without any changes.

Á In the outline of Lex programs shown above, the user's control decisions are represented by the rules. Rules are in the form of a table whose left column contains regular expressions and the right column contains actions which are the program fragments to be executed when these expressions are recognized by the analyzer.


  • An individual rule can be as
  • Integer printf("found keyword INT"); look for the string integer in the input stream and print the message ''found keyword INT'' whenever it appears.

    In this example the host procedural language C and the C library function printf is used to print the string. The end of the expression is represented by the first blank or tab character. If the action is merely a single C expression, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces.

  • Suppose it is desired to change a number of words from British to American spelling. Lex rules such as colour printf("color"); mechanise printf("mechanize"); petrol printf("gas"); would be a start.

But the problem is that these rules are not quite enough, since the word petroleum would become gaseum, which is incorrect.


Some of the predefined variables used by the Lex analyzer are:

  • yytext -- a string containing the lexeme
  • yyleng -- the length of the lexeme
  • yyin -- the input stream pointer
  • the default input of default main() is stdin
  • yyout -- the output stream pointer
  • the default output of default main() is stdout.
  • cs20: %./a.out < inputfile > outfile


A regular expression specifies the strings that need to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expression integer matches the string integer wherever it appears and the expression a89F looks for the string a89F.

  • The quotation mark operator (") indicates that whatever is contained between a pair of quotes is to be taken as text characters. Thus xyz"++" matches the string xyz++ when it appears.
  • Thus by quoting every non-alphanumeric character being used as a text character, the user can avoid remembering the list above of current operator characters, and is safe should further extensions to Lex lengthen the list.
  • An operator character can also be converted into a text character by preceding it with \ as in xyz\+\+ which is another, less readable, equivalent of the above expressions.
  • Any blank character not contained within [ ] must be quoted. Several normal C escapes with \ are recognized: \n is newline, \t is tab, and \b is backspace. To enter \ itself, use \\. Since newline is illegal in an expression, \n must be used; it is not required to escape and backspace. Every character but blank, tab, newline and the list above is always a text character.

Character classes:

  • Classes of characters can be specified using the operator pair [ ]. The construction [abc] matches a single character, which may be a, b, or c (Within square brackets, most operator meanings are ignored).
  • The three characters that are special are \ - and ˆ. The - character indicates ranges. For example: [a-z0-9<>] '-' indicates the character class containing all the lower case letters, the digits, the angle brackets, and underline. Ranges may be given in either order. Using - between any pair of characters which are not both upper case letters, both lower case letters, and both digits is implementation dependent and will get a warning message.
  • If it is desired to include the character - in a character class, it should be first or last; thus [-+0-9] matches all the digits and the two signs. In character classes, the ˆ operator must appear as the first character after the left bracket; it indicates that the resulting string is to be complemented with respect to the computer character set. Thus [ˆabc] matches all characters except a, b, or c, including all special or control characters; or
  • [ˆa-zA-Z] is any character which is not a letter. The \ character provides the usual escapes within character class brackets.

Arbitrary character:

  • It is used to match almost any character which contains the operator character - the class of all characters except newline.
  • Escaping into octal is possible although non-portable.

E.g. [\40-\176] matches all printable characters in the ASCII character set, from octal 40 (blank) to octal 176 (tilde).

Optional expressions:

The operator '?' indicates an optional element of an expression. E.g. - ab?c matches either ac or abc.

Repeated expressions:

  • Repetitions of classes are indicated by the operators * and a+ which represents any number of consecutive a characters, including zero; while a+ is one or more instances of a.
  • For example: [a-z]+ is all strings of lower case letters. And [A-Za-z][A-Za-z0-9]* indicates all alphanumeric strings with a leading alphabetic character. This is a typical expression for recognizing identifiers in computer languages

Alternation and Grouping:

  • The operator | indicates alternation.
  • E.g. - (ab | cd) will match either ab or cd.

  • The parentheses are used for grouping, although they are not necessary on the outside level


The following library routines are used by Lex analyzer:

  • yylex()-The default main() contains a call of yylex()
  • yymore()-return the next token
  • yyless(n)-retain the first n characters in yytext
  • yywarp()-is called whenever Lex reaches an end-of-file The default yywarp() always returns 1.


There are two steps in compiling a Lex source program. First, the Lex source must be turned into a generated program in the host general purpose language. Then this program must be compiled and loaded, usually with a library of Lex subroutines. The I/O library is defined in terms of the C standard library.

The C programs generated by Lex are slightly different on OS/370, because the OS compiler is less powerful than the UNIX or GCOS compilers, and does less at compile time.

The resulting program is placed on the usual file a.out for later execution. Although the default Lex I/O routines use the C standard library, the Lex automata themselves do not do so; if private versions of input, output and unput are given, the library can be avoided.


  • www.wikipedia.com
  • Lex- A Lexical Analyzer Generator, by-M. E. Lesk and E. Schmidt
  • epcc.sjtu.edu.cn/cs308compilers/lex.pdf
  • www.google.com
  • vip.cs.nctu.edu.tw/course_data/08_fall/PL/Ch04.pdf