This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
In electrical engineering and computer science, image processing is any form of signal processing for which the input is an image, such as a photograph or video frame. The output of image processing may be either an image or, a set of characteristics or parameters related to the image. Most image processing techniques involve treating the image as a two dimensional signal and applying standard signal-processing techniques to it. Digital image processing is the use of computer algorithms to perform image processing on digital images. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and signal distortion during processing. Since images are defined over two dimensions (perhaps more) digital image processing may be modeled in the form of Multidimensional Systems.
It is the process of examining a pattern and assigning a class using a classifier (e.g., a rule based on the location of a graphical representation of the given sample with respect to other samples of the known class). Pattern recognition is used in diverse applications: handwriting recognition, financial analysis, gene expression, biometrics, and so on. Pattern recognition aims to classify data based either on a priori knowledge or on statistical information extracted from the patterns. The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate multidimensional space. This is in contrast to pattern matching, where the pattern is rigidly specified.
Optical character recognition (OCR):
Optical Character Recognition (OCR) is a system used to convert scanned printed/handwritten image files into machine readable/editable format such as text document. OCR software receives its input as an image, processes it and compares its characters with a set of OCR fonts stored in its database. Character recognition, which is one of the applications of pattern recognition, is of great importance these days. Character recognition systems can be used in:
Financial business applications: for sorting bank checks since the number of checks per day has been far too large for manual sorting.
Commercial data processing: for entering data into commercial data processing files, (e.g. for entering the names and addresses of mail order customers into a database). In addition, it can be used as a work sheet reader for payroll accounting.
In postal department: for postal address reading, sorting and as a reader for handwritten and printed postal codes.
In newspaper industry: high quality typescript may be read by recognition equipment into a computer typesetting system to avoid typing errors that would be introduced by re-punching the text on computer peripheral equipment.
Use by blind: It is used as a reading aid using photo sensor and tactile simulators, and as a sensory aid with sound output. In addition, it can be used for reading text sheets and reproduction of Braille originals.
In facsimile transmission: that involves transmission of pictorial data over communications channels. In practice, the pictorial data is mainly text. Instead of transmitting characters in their pictorial representation, a character recognition system could be used to recognize each character then transmit its text 9 codes. Finally, it is worth to say that the biggest potential application for character recognition is as a general data entry for the automation of the work of an ordinary office typist.
OCR can be of two types: (1) on-line character recognition; and (2) off-line character recognition. The off-line OCR system type deals with printed and handwritten texts, while the on-line OCR system type deals with handwritten texts only see Figure 1.
Fig. 1 the pattern recognition and the character recognition system
If the OCR system has the ability to trace the points generated by moving a special pen on a special screen, then the system belongs to the on-line type, while it belongs to the off-line type when it accepts only the pre-scanned text images to perform the recognition process.
Background of Study:
History and Characteristics of the Sorani alphabet:
Off-line OCR system:
The offline OCR system can be divided into predefined processes which yield a recognized text. Figure 2 illustrates the four main standard processes of the offline OCR system which apply equally importantly to any offline OCR system: (1) Preprocessing; (2) Segmentation; (3) Features Extraction; and (4) Recognition. The processes number is standard even if it is different in some offline OCR systems.
Fig.2 the standard offline OCR system
Preprocessing step is the most important because it directly affects the reliability and efficiency in the quality of the output. It involves many operations on the digitized image, of a raw image, used to minimize noise and increase capability of the extracting features, by thinning and cleaning the image. Those operations are namely; binarization, smoothing, thinning, alignment, normalization, and base-line detection figure 3.
Fig.3 Preprocessing Stage
Binarization: Converts a gray scale image into bi-level image. A reliable binarization method is by computing the histogram of the gray values of the image and then finding a cutoff point.
Filtering and Smoothing: Filtering and smoothing are conditioning steps that remove unwanted variations in the input image.
Skeletonization (Thinning): Is a very important preprocessing step for the analysis and recognition of the Sorani OCR because it is the process of simplifying the character shape, in an image, from many pixels wide to just one pixel to reduce the amount of data to be handled. Thinning algorithms can be classified into two types; sequential algorithms, and parallel algorithms, figure 4. There is a main difference between these two types of thinning algorithms is that sequential algorithm operates on one pixel at a time, and the operation depends on preceding processed results, while parallel algorithm operates on all the pixels simultaneously.
Fig.4 Skeletonization (Thinning)
Normalization: Sorani characters sizes vary enormously. Therefore, normalization method should be followed to scale characters to a fixed size and to centre the character before recognition.
Slant Correction: One of the most obvious measurable factors of different handwriting styles is the angle between longer strokes in a word and the vertical direction referred to as the word slant. The aim of this stage is to detect any slanted strokes. It can be achieved in two steps; slope detection and slant correction.
Baseline Detection and Skew Detection: is defined as the line on which letter lie. It contains useful information about the orientation of the character. Horizontal projection histogram is considered as one of the methods for fixing the baseline.
After the preprocessing stage, most of the OCR systems isolate the individual characters or strokes before recognizing them. Segmenting a page of text can be divided into two levels: page decomposition and word segmentation. Page decomposition used to separate the different page elements, producing text blocks, lines, and sub-words when the page contain different object types like graphics, headings, and text blocks. Word segmentation, on the other hand, used to separate the characters of word and sub word. The performance of the system depends on how accurately they isolate the characters. As this statement is generally true for cursive text recognition, it is especially pertinent to Arabic and the other similar alphabets, where characters connect within a word.
The next step after segmentation process is the feature extraction in which the produced in segmentation step is used to extract some features which in turn passed to the next stage, the classifier. Features can be categorized into: global transformations, structural features, statistical features, and template matching and correlation. The features can be manipulated in two ways:
Interleaved control, in which an optical character recognition system alternates between feature extraction and classification by extracting a set of features from a pattern, passes them on to the classifier then extracts another feature and so on.
One step control, in which an optical character recognition system extracts all the required features from a primitive and then makes the classification.
It is also named classification step. Classification is the main decision making stage in which the extracted features of a test set are compared to those of the model set. Based on the features extracted from a pattern, classification attempts to identify the pattern as a member of a certain class. When classifying a pattern, classification often produces a set of hypothesized solutions instead of generating a unique solution. Classification follows three main models: syntactic (or structural), statistical (or decision theoretic), and neural networks classification. Generally, there are five main paradigms for performing pattern recognition: (1) template matching; (2) geometrical classification; (3) statistical classification; (4) syntactic or structural matching; (5) artificial neural networks.
Problem statement of the research:
The OCR system is very important because it improves the interactivity between the human and the computers and it has many practical applications that are independent of the treated language. So far there is no convenient OCR system available for the modern turkey alphabet. Based on this reason, this research focuses on the problem of the modern Turkish alphabet features to produce successful off-line OCR system. This system consists of several stages, starting from preparing the database of modern Turkish alphabet, input the database to the computer by scanner, reading the input which is an image file, processing it, and after that convert it to an editable format. Then, this OCR system can be integrated into devices such as mobile phones to convert any image file (captured by Camera/mobile phone or scanned by a scanner) to machine readable/editable format.
The aim of this project is to develop a simple and easy to use OCR system for off-line modern Turkish alphabet. To achieve this aim, the following objectives are set:
To introduce and highlight the characteristics of modern Turkish alphabet.
To provide the Sorani alphabet database.
To convert any image file into readable/editable format.
To improves the interactivity between the human and the computers.
To investigate about the available Pattern Recognition and image processing approaches and finds a suitable one for OCR.
The fundamental idea behind this project is to develop a simple and handy OCR system that can be integrated in some devices such as mobile phones and laptops. The developed OCR is used to convert input image files consisting of modern Turkish text into editable format.
Scope of research:
This research will be under the Computer Vision and Pattern Recognition. An OCR algorithm will be developed to convert the scanned text image into an editable text document. The system algorithm is programmed in MATLABÂ® as it provides special features such as efficient matrix and vector computations, application development including graphical user interface building, string processing, etc. The template set involved in the recognition process was prepared with Paint and imported into the OCR algorithm. Users can import their text images using scanners, digital camera, or they can make it with Paint. The latter was used during implementation and testing phase of the project. Output text document can be printed out or observed on the computer screen. The initial setup of the project with the use of scanner or smart phone for image digitization, personal computer for image processing, and printer for output observation is shown in Figure 5.
Fig. 5 required tools and equipments.
The proposed method will be implemented using MATLABÂ® which has powerful features as mentioned earlier. Template matching is utilized as the OCR approach. Unlike neural networks approach, template matching takes shorter time and does not require sample training. The OCR main steps are depicted in Figure 6.
Fig. 6 OCR project main steps.
First, the template is prepared and preprocessed. The preprocessing involves digitization, binarization, and noise removal. Next, the image is processed by identifying the lines and then the characters using template-matching scheme. Finally, upon a successful implementation of the OCR, the recognized patterns are displayed in a text document.
The following conditions are assumed during the implementation of the proposed OCR:
The font family that will be used is Arial (as it is widely used), black, bold, and of size 12 points. The input image resolution will be ranged by compressed 196 dpi fine-mode fax quality up to 400 dpi, and the template size will be 24x42.
The image will be in black and white, clear from noise or with little noise.
The input image consists of text only which will be divided into lines with one word per line.
Characters to be recognized are modern Turkish alphabets with uppercase only.
The templates to be processed with the OCR will be prepared by using Paint and MATLABÂ®. The templates will be drawn using "Text" in Paint and then saved in MATLABÂ® current directory. Thereafter, a small code will be used to crop the image and resize it to the desired size. This code also ensures that all the templates will be binarized and pixel-inverted, so that the template background will be black and the letter will be in white.
Before implementing the OCR, some preprocessing is required to convert the image into a valid format, which is ready for recognition. As shown in Figure 4, the preprocessing includes digitization, binarization, noise removal and skew correction.
As the input will be in a physical paper format, it should be converted into a digital format so that the system can manipulate it. This conversion from a printed page to a digital image involves specialized hardware such as an optical scanner or digital camera. The converted digital image is then saved for further processing. In this project, the input images will be used for testing will be created by using Paint program, such that the text that will be typed in Paint, should be saved in a specific format (e.g., JPEG), and then will be processed by using Matlab.
A binary image is a digital image that has only two possible values for each pixel. Typically the two colors used for a binary image are black, 0, and white, 1, though any other pair of colors cannot be used. In a binary image, the color used for the object(s) is the foreground color while the color of the rest of the image is the background color. As OCR often deals with text, which is usually black and white, input images should be converted to a binary format. Images in binary representation need very small space to store, but usually suffer information loss. After binarization, the pixels of the image will be inverted to have a black background and white foreground. This color inversion makes the calculations simplified, especially during the line identification process in which many calculations will be performed with the image background. Therefore, by making the background value zero (i.e., black), the calculations can be simpler.
During the scanning process, differences between the digital image and original input (beyond those due to quantization when stored on the computer) can occur. Hardware or software defects, dust particles on the scanning surface, improper scanner useâ€¦ etc can change the expected pixel values. Such unwanted marks and inaccurate pixel values constitute noise that can potentially reduce character recognition accuracy. There are usually two types of noise; first type is the additive noise where background pixels are assigned a foreground value instead of a background value. This type of unwanted noise can be reduced or eliminated by removing any groups of foreground pixels that are smaller than some threshold but without removing small parts from characters such as 'i' and 'j' or some punctuations. For example, in the OCR code that will be used in this project, the following Matlab function will be used to remove all objects containing fewer than 10 pixels,
imagen = bwareaopen (imagen, 10) â€¦ (1)
The second type of unwanted noise is when pixels are assigned a noisy background value instead of the foreground value that it should have been given. A linear or non-linear filter can be used to smooth the noisy images. These filters must be used carefully because if extra smoothing is applied, the filtering can cause some problems such that discontinuous character edges become joined or multiple characters merge together.
Segmentation and Clustering:
In this stage, the clean digital page will be segmented so that the individual characters of interest can be extracted and subsequently recognized. The approach that will be used in this stage is a top-down approach. First, the lines in the page under process can be identified by using horizontal projection, by which the page will be scanned horizontally to allocate the first and end lines and divide/segment the page into lines. Thereafter, individual characters of each line can be recognized using vertical projection, by which each line is scanned vertically to find groups/clusters of connected pixels, where each group of them represents one character. Finally, the characters composing the lines will be compared with templates to obtain the best match.
Line identification and word extraction:
In general, any image consists of rows and columns of pixels; group of columns and rows constructs one line or specifically one word. Therefore, to identify lines of an image, the number of rows will be our main concern. Firstly, the borders (i.e., first and last row of the file) have to be identified. This can be done by scanning the image from up to down to remove the additional rows whose all columns labeled by zero. The scanning is paused when a row is found with a column labeled 1, and thus this row is recorded as the first row. The scanning is then resumed until another cluster of rows whose all pixels are zeros is detected and deleted. The row just before this group of rows is called the last row. Secondly, the rows between first and last rows of the file are to be processed. The horizontal projection will be used for line identification as stated earlier. The goal of the horizontal projection is to group the horizontally-connected pixels and to give them the row number as a label. If the rows are connected, their labels should be consecutive (this is only applicable for uppercase alphabets). The total number of rows constituting one word is just 12 because of the font size (i.e. 12-pt font size). Lines are then extracted and stored in an array, and so on (Figure 7).
Fig. 7 line identification
Connected components analysis and character extraction:
With this approach, vertical projection will be performed and connected pixels will be grouped together to create a set of connected components. There are two schemes: (1) 8-connected scheme; and (2) 4-connected scheme as illustrated in Figure 8. In this OCR project, 8-connected scheme is used because it is more accurate, as it considers all the surrounding pixels.
Fig. 8 A single pixel with its 4- connected neighborhood (left), and 8- connected neighborhood (right).
OCR template matching:
Resized images from the previous stage will be use to compared it with the templates stored in the database to obtain the best match. For this purpose, the Matlab function (Equation 2) can be used to compute the correlation coefficient between A and B, where both of these characters are matrices or vectors of the same size. Therefore, matching characters will be based on the correlation output. Some problems are incurred in the correlation output of some letters with similar shape such (as R and P, O and Q) as well as the two combined letters in the case of AZ, AA, TT â€¦ etc. These problems will be overcome by considering the number of pixels in conjunction with the correlation function output, thus, the best matching character can be found. The output which will be a series of characters can be printed in a text document format.
r = corr2 (A, B) â€¦ (2)
The correlation algorithm used in Equation 3.3 is as shown in Equation 3:
Where Ä€= mean2 (A), and B= mean2 (B). The total number of pixels of each image is 1008 consisting of black and white pixels. Using this feature, the characters can be compared and the problems addressed earlier can be overcome. For example, with the extraction mechanism used the two letters (AA) are considered one character because some of their pixels are connected at the bottom. Therefore, the correlation output will find the best match which is in this case the letter (M). However, this problem can be solved by utilizing the number of pixels feature; the number of foreground pixels of AA (>500) is greater than that of M (<500).