LEX, a lexical analyzer generator

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


It is the process of converting a sequence of characters into a sequence of tokens. The programs which are used to perform the lexical analysis are called lexical analyzers or lexers. A lexer is often organized as separate scanner and tokenizer functions, though the boundaries may not be clearly defined. The whitespace characters are often ignored during the lexical analysis.

Terminology used:


A token is categorized bock of text. A lexical analyzer divides a sequence of characters into tokens and categorizes them according to the function, giving them meaning. It is a useful part of structured text. A lexical analyzer works with the combination of tokens and parser.

Consider the statement: Difference=3-2; Here sum is an identifier, = is an assignment operator, 3 and 2 are numbers, + is an addition operator and; identifies end of statement. The tokens are defined by regular expressions which are understood by a lexical analyzer generator such as lex.


It has information on the possible sequences of characters that are contained with tokens which it handles.


It is the process of clarifying sections of string of input characters. This process can be considered as a subtask of parsing input.

Introduction to LEX:

The lexical processing of character input streams can be done with the help of LEX. By accepting a high level language and problem oriented specification added by the user, it produces a general purpose language. The regular expressions are recognised by the language so produced and these expressions are then specified by the user. These expressions are then recognised as the input stream by the LEX code and hence the string is produced by matching the expression after the partitioning of the input stream. The program fragments and the regular expression are then associated by the LEX source file. Since the expression appears in the input to the program being written by the LEX, the fragment related to that gets executed. The parsers can be generated either in C or RATFOR provided the condition that the language can be automatically converted to portable FORTRAN. It is basically designed to simplify interfacing with YACC for access to this compiler-compiler system.

If the text stream is:

"if (x>2.3>{ printf ("bye %f);..."

This is the token stream so generated.

Lex Source:

Basically it is the table constituting regular expressions and program fragments corresponding to those expressions. The table gets translated to a program that reads an input stream and after being copied to the output stream, the input is partitioned to the strings and these strings then match the regular expressions. After the recognition of string, the corresponding program fragment is executed and the recognition of these expressions is performed by a finite automation generated by LEX. These fragments are then executed in the sequence the regular expressions proceeded in the input stream.

Function of host language used by LEX:

As LEX is not a complete language, rather it represents a new language feature. These features are then added to different programming languages that are called host languages.

Different codes can be written in different host languages with the help of LEX. This language is used for output code which is generated by LEX and for the program fragments that are added by the user. Since the compatible run time libraries for various host languages is provided so LEX is adaptable to various environments and various users.

How LEX converts the source into host language?

The source is converted to the host general purpose language with the help of LEX and the program so generated is called yylex. This program recognizes the input stream and hence performs the tasks specified to it as:


Source -> | Lex | -> yylex



Input -> | yylex | -> Output


LEX can be used for simple transformations or for analysis and collecting statistics on lexical level. It can also be used with the parser generator in order to perform lexical analysis phase and it is easy to mix LEX with YACC..

How LEX and YACC works together:

LEX programs recognize only the regular expressions and the parser written by YACC accept a large class of context free grammars, and they require the low level analyzer which can recognize the input tokens. When used as pre-processor for the parser generator, the input stream is partitioned with the help of LEX and the structures are assigned to them with the help of parser generator. This is shown as:

Lexical grammar

Rules rules

| |

v v

+---------+ +---------+

| Lex | | Yacc |

+---------+ +---------+

| |

v v

+---------+ +---------+

Input -> | yylex | -> | yyparse | -> Parsed input

+---------+ +---------+

This is the working of LEX with YACC.

LEX Regular Expressions: it specifies the string set that has to be matched. It constitutes the text characters and the operator characters. The expression integer matches the string integer whenever it appears in the program.

Operators: The operator characters are

" [] < > / : ; { } % * ( )

If we want to use these operators as text characters then an escape character can be used.

The quotation mark operator (") indicates that whatever is contained between a pair of quotes is to be taken as text characters. Thus xyz"++" matches the string xyz++ when it appears.

The user can avoid remembering the list of current operator characters by quoting every non-alphanumeric character being used as the text character.

An operator character can also be turned into a text character by preceding with \ Consider the example:



Whenever any expression is matched, lex executes the action. There is a default action, which consists of copying the input to the output. This is performed on all strings not otherwise matched.

When Lex is being used with Yacc, this is the normal situation. One may consider that actions are what is done instead of copying the input to the output; thus, in general, a rule which merely copies can be omitted. Also, a character combination which is omitted from the rules and which appears as input is likely to be printed on the output, thus calling attention to the gap in the rules.

One of the simplest things that can be done is to ignore the input. Specifying a C null statement, ; as an action causes this result. A frequent rule is [ \t\n] ;

which causes the three spacing characters (blank, tab, and newline) to be ignored.

Another easy way to avoid writing actions is the action character |, which indicates that the action for this rule is the action for the next rule. The previous example could also have been written

" "



with the same result, although in different style. The quotes around \n and \t are not required.

MACRO DEFINITONS: Lex also permits access to the I/O routines it uses. They are:

1) Input (c) which returns the next input character;

2) Output (c) which writes the character c on the output; and

3) Unput (c) pushes the character c back onto the input stream to be read later by input().

By default these routines are provided as macro definitions, but the user can override them and supply private versions. These routines define the relationship between external files and internal characters, and must all be retained or modified consistently. They may be redefined, to cause input or output to be transmitted to or from strange places, including other programs or internal memory; but the character set used must be consistent in all routines; a value of zero returned by input must mean end of file; and the relationship between unput and input must be retained or the Lex look ahead will not work.

Another Lex library routine that the user will sometimes want to redefine is yywrap() which is called whenever Lex reaches an end-of-file. If yywrap returns a 1, Lex continues with the normal wrap up on end of input. Sometimes, however, it is convenient to arrange for more input to arrive from a new source. In this case, the user should provide a yywrap which arranges for new input and returns 0. This instructs Lex to continue processing. The default yywrap always returns 1.


There are two steps in compiling a Lex source program. First, the Lex source must be turned into a generated program in the host general purpose language. Then this program must be compiled and loaded, usually with a library of Lex subroutines. The generated program is on a file named lex.yy.c. The I/O library is defined in terms of the C standard library.




Textbook of system software by Leland L.Beck.