Optical Character Recognition System Using Diagonal Features Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Feature extraction and recognition are the important phases of a typical OCR system. The recognition accuracy mainly depends on the feature extraction method. There are several feature extraction methods that are suitable for different scripts. Most of the feature extraction methods are based on character templates, structural and statistical features. Similarly for the recognition of the characters several classifiers are available. These classifiers include kNN, neural networks, SVM etc. Telugu is one of the most popular South Indian languages with large and complex character set. Neural networks are generally used to solve complex problems such as pattern recognition, speech processing, character recognition etc. In this work we propose a neural based optical character recognition system for Telugu characters with diagonal features. The experimental result shows that the proposed system can achieve a recognition rate of 98.9%.

Index Terms- OCR (Optical Character Recognition) Diagonal features, neural networks, SVM.


There are several historic documents that were machine printed or handwritten. In order to make them available on the web they need to be digitized. Once the document images are digitized by using the OCR systems, they can be accessed from the web. So there is a great need for the OCR system development. OCR is a computer program that converts handwritten or machine printed image documents into editable text documents. Once it is translated into a text document, it can be stored in ASCII or UNICODE format. There are several other applications of OCR systems. Some of the applications are reading aid for the blind, automatic text entry for DTP (Desk Top Publishing, multimedia systems, language processing, storage of the image documents in electronic format etc[1,2, 3].

Telugu is a script based most popular South Indian spoken language. The character set of Telugu contains 16 vowels, 36 consonants, vowel (maatras) and consonant modifiers (vaththus). These orthographic units are combined to represent several frequently used syllables. We refer to these basic orthographic units as glyphs (single connected component representation). These characters will have variable size (i.e. width and height). With this basic character set, several compound characters can be formed. It is estimated that there are 5000 to 10000 compound characters are possible. Fig. 1.1 shows the Telugu alphabet and Fig 1.1(e) shows a Telugu digits and Fig 1.1(f) shows a sample text in which there are compound characters [4].

The processing steps of a typical OCR system include image acquisition, preprocessing, segmentation, feature extraction and recognition or classification as shown in Figure 1.2.

(a) Consonants (b) Vowels

© Vowel modifiers (d) Consonant modifiers

(e) Telugu digits (f) Sample compound characters

Fig. 1.1 Telugu Character set

Fig. 1.2 Processing steps of a typical OCR system

Image Acquisition

In Image acquisition, the recognition system acquires a document image as an input. The image should have a specific format such as JPEG, BMP etc. This image is acquired through a scanner, digital camera or any other suitable digital input device.


The pre-processing is a series of operations performed on the scanned input image. In this stage several are tasks performed on the image. Some of the tasks such as binarization, noise removal, thinning, discourse and segmentation as described below:

Binarization: This process converts a gray scale or color image into a binary image. Several methods are available for the binarization. In this work the global threshold based Otsu method is adapted for binarization.

Noise removal: Removal of isolated pixels in the image.

Thinning: Thinning is a morphological operation that is used to remove selected foreground pixels from binary images

Discourse: It is defined as the shortest matrix that fits the entire character skeleton.

Segmentation: This extracts lines, words and then finally characters from the noise and skew free document image [5].


This phase includes the feature extraction, selection and classification steps. Feature extraction is a special form of dimensionality reduction. When the input data to an algorithm is too large to be processed and redundant then the input data will be transformed into a reduced representation set of features (also named features vector). Transforming the input data into the set of features is called feature extraction. In the extracted features, some of them are redundant. Feature selection is the process of selecting the selecting the essential features that are sufficient for the classification. There are several feature extraction methods available for character recognition. These methods are broadly classified as template based, structural and statistical.

For the classification we can use character templates, neural networks, Support Vector Machines (SVM), k-Nearest Neighbor (kNN) approaches. In this work neural networks are used as classifiers.

Diagonal based Feature Extraction Approach

This method is based on the diagonal pixels [6]. In this approach, the given image is resized to 90x 60 pixels and divided into 54 equal zones, each of size 10x10 pixels. From each zone the features (i.e counting the number of black pixels) are extracted by moving along the diagonals. Since each zone is of 10x10 pixel, it will have19 diagonal lines and the foreground pixels present long each diagonal line is summed to get a single sub-feature, thus 19 sub-features are obtained from the each zone. For each zone a single features is identified by computing the average of these 19 sub-features. This procedure is sequentially repeated for the all the zones. There could be some zones whose diagonals are empty of foreground pixels. The feature values corresponding to these zones are zero. So for each character 54 features are extracted. In addition to these 54 features 15 more are obtained by averaging the values placed in zones row wise and column wise, respectively as there are 9 row and 6 column zones. As result, every character is represented by 69, that is, 54 +15 features. The logical representation of this method is given in Figure 2.1 and the complete algorithm as follows:

Fig. 2.1Logical representation of Diagonal Approach

Algorithm: Diagonal (Image, Feature_Vector)

Image: Input character image of size MxN

Feature_Vector: Is a set of features.

Divide the image into 54 zones each zone being 10x10

For each zone

If a zone is empty, set the feature value of that zone is 0

Starting from the center diagonal in the input image matrix, sum up all the pixels moving upward and then move downward. (i.e. summing the pixels diagonal wise)

Compute the average by dividing the pixel sum with the number of diagonals. (i.e. 19)

Store this value in the Feature_Vector

Sum all the diagonal features row wise and column wise

Add these column and row features to the Feature_Vector

Return the Feature_Vector.

neural network topology and parameters

For the recognition Multilayer Perceptron Neural Network [7, 8] is employed and the topology is shown in Figure 5.1. For training the network, the supervised back-propagation algorithm is used. The neural network contains three layers. Since the feature vector size is 69, input layer contains 69 nodes and the output layer is implemented with 16 nodes. The number of hidden neurons is determined from trial and error method using the Telugu character set consisting of 364 characters shown in Figure 3.1.

Fig. 3.1Neural network Topology

Table Type Styles

No. of Hidden Neurons







































Graph 1:Ploted between hidden Neurons vs Epochs

Graph2:Ploted between hidden Neurons vs Error

Results and discussions

The proposed system is implemented [8] using MATLAB 7.6. The system is trained and tested using gradient descent with mean square error as performance function. The number of epochs and hidden neurons are determined from the experimentation by setting the training goal 10-6, maximum epochs 106 and momentum constant 0.9. Table 1 and graph1 shows the variation of error against number of hidden neurons and it can be observed that up to at 100 hidden neurons the error is relatively high. At 125 the error is less and the algorithm can be assumed to be converged. After 125 hidden neurons the increasing of error starts. i.e. algorithm divergence starts. With hidden neurons 125 the network could successes fully recognized 360 characters out of 364 with an accuracy of 98.9%.

conclusions and future scope

The proposed system is implemented with diagonal features and neural networks as classifier. It makes use of the capabilities of neural networks. A three layer MLP network is trained and tested with Telugu characters shown in Fig. 5.1. It could successfully recognize the 360 out of 364 Telugu orthographic symbols with the recognition accuracy of 98.9%. Various neural network parameters are determined from the experimentation. The proposed system can recognize only the trained orthogonal symbols with a specified font. It can be further extended to include all the possible characters with different fonts and sizes. The recognition accuracy can be further improved by adding some more features or with a better feature extraction method.