Explaining the use of scanner compiler

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


This chapter explains the use of scanner in the context of compiler. Scanner helps the compilation process in checking the grammar of the programming language. Lexical analyzer and syntactic analyzer use scanner's output to check the language syntactically. The analysis proceeds with the checking of the language semantics once the process complete.

Scanner is a tool to tokenize and analyse programming language based on its grammar. It is a complex process where it can be solved by separating the analysis into two parts. The first part is to identify the token and the second part is to determine the syntactic organization.

The scanner becomes an interface between the source program and the parser. The parser generates a syntax tree out of the program based on its grammar. The syntax tree has leaves which represent the terminal symbols which the scanner extracts from the program.

The interaction between the scanner and the parser is in two ways. The first one is; the scanner processes the source program before parsing and stores the token in a big table or file. The second one is; the interaction between parser and the scanner where the scanner is called by the parser whenever the next token in the source program is required. If there is no needs of internal form before parsing begin, the programmer will use the later approach as the preferred method.

Token is created when the scanner breaks the program into pieces. The token is represented in the form of a unique internal representation number or integer. The string token is stored in a table. Then, the values of constants are stored in a constant table while variable names are stored in a symbol table for variables. The scanner then returns the internal type of the token and sometimes the location in the table where the token was stored.

If we have this statement total = num1 + num1; the following items returned by the scanner :


Internal representation number




















Reading Assignment

Write a summary on following paper : Grosch, J. and Emmelmann, H.: A tool box for compiler construction. Compiler Compilers, 106-116. (1991).

Scanner Design

The main purpose of the scanner is to return the next input token to the parser. To be able to return a token, the scanner must isolate the next sequence of characters in the source text which designates a valid token.

To do this, the scanner must also edit the source program, removing information such as comments, blanks, line boundaries, and whatever else is not important such as comments, blanks, line boundaries, and whatever else is not important to be parsing and code generating phases of the compiler.

The scanner must identify the complete token and sometimes differentiate between keywords and identifiers. Furthermore, the scanner may perform symbol-table maintenance, inserting identifiers, literals, and constants into the tables after they have been converted to an internal form.

Tokens can be described in several ways. One way of describing tokens is by using a regular grammar. Using this method of specification, generative rules are given for producing the desired tokens. For example, the regular grammar

<unsigned integer> ::= 0|1|2|3|..|9|

0<unsigned integer> | 1<unsigned integer> |

… | 9 <unsigned integer>

contains the rules for generating the set of natural numbers.

However, the scanner must recognize tokens. To recognize a mathematical model called a finite-state acceptor (FSA) is used.

FSA is a finite state machine with no outputs. The user of a FSA cares only about the final state: if the machine ends in an accepting state after processing a series of inputs, the machine is said to have accepted the input; otherwise, it is said to have rejected the input (Group 2004).

Whenever an FSA begins reading a tape, it is always in a certain state designated as the starting state. Some of the states the acceptor may be in are called final states. If the acceptor attempts to read beyond the end of the tape while in a final state, the string which was on the tape is said to be accepted by the FSA. Below figure shows the example of FSA:











Read head

Finite state acceptor

FSA is always represented by using finite state diagram. As below figure:




Assume that the tape of FSA is reading contains the string 12.75. The FSA begins in the state S, the starting state, and it remains in this state as each digit in the string preceding the decimal point is read. When the decimal point is read, the FSA changes to state A and then it changes to state B when the first digit (5) and remains in state B. As the end of the tape has now been reached, the FSA terminates in the final state B and accepts the given decimal number as being valid.

If the FSA had not been in state B after a string had been read, the string would have been rejected. The string 12, is such a case. Rejection may also occur if no transition is designated at a certain state for some character that has been read. In the FSA of above figure, this would occur if the FSA was in state S, state A or state B and, say, an alphabetic character had been read. Examples of strings which would be rejected for this reason are A15.1 and 12.B75.

Regular expressions

A regular expression r denotes a language L(r), also called a

regular set .

Operations on regular expressions:

 : empty set

 : empty string

a : single symbol a that is {a}

if e1 and e2 are regular expressions denoting the languages L1 and L2, then;

(e1) | (e2) is L1

Regular grammar is defined to consist only of productions of the form   , where || <= ||, and  has the form aB or a where and .

A more compact way of representing regular languages is with the use of regular expressions. Regular expressions make use of three operators: concatenation, alternation, and closure.

Assume that the two expressions and generate the languages and . Concatenation is then defined as . Alternation is the union of the languages denoted by two expressions using | or +.

Hence, .

Closure is represented using braces { }, denotes the repetition of the expression zero or more times. Thus, where .

For example, the expression 110 consists of the digits 1, 1, and 0 concatenated together and denotes the language L={110}. The expression 0|1 denotes the language L={0,1} while the expression {1} denotes the language .

For the remainder of this section, a method for converting regular grammars to regular expressions is presented. At this point, however, it should be noted that the regular-expression operators obey some algebraic rules.

For example, both alternation and concatenation are associative, with alternation also being commutative. Thus we can write (ab) = a(bc), (a|b)|c = a|(b|c), and a|b=b|a. Distributivity also applies, with concatenation distributing over alternation; that is, a(b|c) = ab|ac.

To represent a method of converting a regular grammar to a regular expression, consider the following regular grammar which is .

S  aS

S  aB

B  bC

C  aC

C  a

Replacing the production operator  with an equal sign and combining all possible productions from a given nonterminal into one expression by using the alternation operator, the grammar can be written as the set of equations

S = as | aB

B = bC

C = aC | a

By solving this set of equations, a regular expression with only terminal symbols is derived which generates the same language as the original regular grammar. Solutions for equations which are defined only in terms of themselves should be done first. In this example, the equation for C has a solution of C = {a}a. This can be shown by substituting this solution into the equation for C, giving

{a} a = a{a}a|a

Factoring yields

{a}a = ({a}a|)a

Now, {a}a| = {a}, since  is in both sides of the equation, a is in both sides, and any string is in both sides. Thus we have

{a}a = {a}a

The solution for C can be substituted into the second equation, and the third equation can be dropped as it is no longer needed. This results in the set of equations

S = a(S|B)

B = b{a}a

We immediately have a solution for B which can be substituted into the first equation. This results in the equation

S = a(S|b{a}a)

It can easily be shown that a solution for S is S = {a}ab{a}a, which is a regular expression generating the same language as the original regular grammar.

Finite State Acceptors (FSA)

There are three types of finite state acceptors which are deterministic, nondeterministic and nondeterministic with -transitions. In this chapter we only discuss deterministic FSA.

A deterministic finite state acceptor (DFA) which is also known as a deterministic finite automaton, is an acceptor which for any state and input character has at most one transition state that the acceptor changes to. If no transition state is specified, the input string is rejected.

Definition of DFA as 5-tuple (K, , M, S, Z) where:

K is a finite, nonempty set of elements called states

is an alphabet called the input alphabet

is a mapping from K x into K

is called the initial state or starting state

is a nonempty set of final states

The DFA is initially in state S and reads an input string from left to right. The mapping function M of the DFA defines the state transitions and is denoted by M(Q,T)=R, where Q and R are states of K and T is a character from the input alphabet. The mapping function indicates that when the acceptor is in state Q and T is the next input character, the acceptor changes to state R. The following two definitions complete the specification of the mapping function:

M(Q,) = Q for all

M(Q,Tt) = M(M(Q,T),t) for all and

The first definition implies that a DFA cannot change state without reading a character from the input alphabet, while the second definition is a recursive definition showing that when the DFA is in state Q with some input string x=Tt, the mapping M(P,t) can be applied. This definition extends the applicability of the mapping function to strings over rather than just elements of .

A sentences is said to accepted by a DFA if M(S,t)=P for some DFA F=(K, , M, S, Z) such that and . The string t is accepted by the DFA if, after reading the entire string, the DFA terminates in a final state. The set of all which is accepted by the DFA F is specified by L(F). Thus L(F) is defined as

L(F) = {t|M(S,t) Z and }










0 0

0 0

The following DFA is an example of an acceptor which accepts strings consisting only of an even number of 0s and an even number of 1s.

F = ({S,A,B,C},{0,1},M,S,{S})

Where M is defined as

M(S,0) = B M(S,1) = A

M(A,0) = C M(A,1) = S

M(B,0) = S M(B,1) = C

M(C,0) = A M(C,1) = B

The transition diagram for this acceptor is given above.

An example of a string which is accepted by this DFA is 110101, and a string not accepted is 11101.