An Efficient Approach to Detect and Localise Text In Natural Scene Images

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


AbstractText in natural scene images may provide important information based on the application. Detecting text from natural scene should be effective for that segmenting text from natural scene images should use a high performance method. In this paper, an efficient segmentation and classification technique is used. Given system takes natural scene images as input. After converting the colour image to gray scale image, HOG features are used to find the edge values. Image is segmented using Niblack’s local binarization, which identify the edge on suppressing image’s background. Image is classified using CRF which blocks the text in the natural scene images. This system provides better segmentation of text and classifies with high detection accuracy.

Keywords: Image Processing, Text Detection, Image Segmentation, CRF


Image segmentation plays an important role in image processing applications. The main aim of image segmentation is to divide an image into meaningful regions with respect to a requirement in application. Segmentation may be affected by the metrics taken from the image like, gray level, texture, depth, color. Use of various type of digital imaging devices lead to a need for advanced content-based image analysis techniques. Text information in images are required for various applications. So it becomes a research topic to develop best systems to detect text in images. In this paper we suggest an efficient approach to extract text from natural scenes. Jung et al. [1] tells about a text extraction system. This integrates four processes. They are Text detection, text localization, text extraction, enhancement and identification. The critical step is the text detection and text localization. The difficulties in the existing system are overcome by the proposed work. CRF model classify based on unary component properties and binary contextual component relationships. The rest of the work is organized as follows. Section II gives report on related work. Section III gives report of the proposed work. Section IV gives implementation details. Section V depicts the results. Concluding remarks is given in section VI.


Zhang K et al. [2] tells an enhanced method to detect traffic signs. This takes color image as input and produces segmented binary image by adaptive color segmentation based on pixel vector. The shape feature vectors of different candidate regions are computed by central project transformation. This shape features are given as input for training a neural network, which after training detects traffic signs from candidates.

Yi-Feng et al. [3] says an improved method for text detection. Here Stroke segmentation is done scale adaptive segmentation, stroke verification is performed using CRF model with pair wise weight by local line fitting. And have used ICDAR2005 Competition dataset for experimentation.

Maryam Darab et al. [4] say about a hybrid approach for detecting Farsi text in natural scene images. Here they have chosen two types of Artificial neural network ie., Single layer perceptron, and Multi layer perceptron. SVM is used as base classifier to take advantage of its superior generalization capability though its high computational complexity and its parameters cannot be estimated jointly with the CRF parameter.

ZHU Kai-hua et al. [5] proposed a non linear Ni black method to decompose input in to candidate connected components. CC is then fed to cascade of classifier which is trained by adaboost algorithm.

Jonghyun Park et al. [6] proposed a segmentation technique that segments the input image into chromatic and non chromatic regions according to the RGB elements. Objects are transformed in to wavelet domain for multiresolution analysis and moment features of the wavelet coefficients are used in the SVM for classification of text objects.

Weinman et al. [7] uses a conditional Random Field model for text detection. This uses the combination of contextual information and local detection.


Our proposed work is to develop a well-built system that robustly detect text and localize the text in natural scene images. Here we consider the advantage of region based method and connected components method. The system has two stages. In the first stage text is detected to identify the text existing confidence in the local image regions by performing classification. In the second stage text localization is done by clustering local text regions in to text blocks and text is verified to remove non-text regions for upcoming processing. Generally Connected Component methods have three stages.CC extractions to segment candidate text components from images. CC analysis to filter out non-text components using heuristic rules or classifiers. Finally, In Post-processing text components are grouped in to text blocks. Fig.1 depicts the workflow of the proposed work.


CRF [8] is a probabilistic graphical model. It is used in various applications like natural language processing[9], Computer vision [10], face detection[11] etc.,

Let A be the random variable for the data set, which is to be labeled. B be the label set. Let g=(V,E) be a graph such that it satisfy the following condition

If this satisfies the condition B can be indexed by the vertices of G. Then (A,B) is a CRF, when conditioned on A, the random variables Bv will follow the markov property with regard to graph.

Where means that w and v are neighbors in the graph G. This says that a CRF is a random field globally conditioned on X. For a simple data set the joint distribution over the label set B given A has the following form :


Here a is the data set and b is the label set. b|s means set of components.


This decomposes the image in to connected components. If this step yields poor result it affects the whole system. Because of this additional care is taken here. This method proposed by Winger et al [9]. This is an efficient thresholding method.

Fig. 1 Workflow of the proposed work


In this module text regions are detected. First the text confidence is projected, then the information is scaled by scale-adaptive binarization [15]. Here image is taken as input and produces candidate text components as output.

Fig. 2 Text Detector module


Here CRF [13][14] model is used to filter out non-text components by combining unary component properties and binary contextual component relationships.


Adaptive clustering [11][12] is used to group the text lines or words. Adaptive clustering is unsupervised learning method. It optimizes some explicit and implicit criteria of the image. It supports memorizing clusters that may reuse good clustering.


Here 17 features are used to discriminate text CCs from non-text CCs in our method. Features lie in five categories. Categories are Geometric, Shape regularity, Edge, Stroke, Spatial Coherence. After forming connected components, the segmentation problem is formed as classification. Now it is enough to categorize in to text and non-text blocks.



Step 1: Convert color image to gray-level image

Step 2: Histogram of oriented gradients (HOG) are generated.

Step 3: Find the edge values using HOG features.


Step 4: Segment image by suppressing image’s background.

Step 5: Classify text and non-text blocks using CRF.

Step 6: Text Region is grouped using adaptive clustering.


Step 7: Remove noise by region dilation.

Step 8: Determine the angle of the text block using random transform

Step 9: Crop the text block.

Step 10: Perform quantization, equalization, binarization and normalization.

Step 11: Remove horizontal contours.


The proposed work is implemented using MATLAB. Natural scene image with text is taken as input to the system. The system accepts either BMP or JPEG format. The results are shown in fig. 3-17 respectively.


Fig.3 shows a complex natural scene image which is taken as input in our system. Fig. 4 shows the input gray image on which a feature descriptor is applied i.e, Histogram of oriented gradients (HOG) is applied to produce the HOG features. HOG helps in finding the edge values. Fig.6 shows the image after classification. CRF model is used as the classifier.


Fig. 3Input Image


Fig.4Input gray Image


Fig. 5 HOG featured-Text Confidence


Fig. 6 After classification

  1. Segmentation:

Niblack’s local binarization algorithm [27] is applied, which produced a high efficiency output. Fig. 7 shows the image after background suppression and then it is classified. Fig.8 shows after classification.


Fig. 7 After background suppression


Fig.8 After classification

  1. Adaptive clustering:

Adaptive clustering is applied to group text regions and non-text regions. Fig. 9 shows the image after Region grouping.


Fig. 9 Region grouping

  1. Post processing :

If there is noise in the image, it is removed by region dilation. Fig.10 shows the dilated image. The angle of the text is determined using Random transform. Fig. 11 shows the image that identified angle of the text. Fig.12 shows the cropped image that shows the text. Fig. 13,14,15,16 shows the various steps of post processing.


Fig. 10 Dilated Region


Fig. 11 Angle of text


Fig. 12 Text crop


Fig. 13 Gray Scale text


Fig.14 Text Quantization and Equalization


Fig. 15 Binary Text


Fig. 16 Normalized Text


Fig. 17 Text Horizontal contours adjusted.

The overall performance of the system using the existing dataset is shown in the table.1. Precision, Recall, F1 of the system is shown in the table. The average speed of the system is shown in the table.







Average speed (s)

Text Detection





Text Localization






In this paper we present an efficient approach to detect and localize text in natural scene images. Region information is integrated with reliable connected components method. Also the binary contextual component relationships in combination with unary component properties are integrated in CRF mode, which effectively classifies the text and non-text regions. An experimental result shows that proposed work is effective in unconstrained scene text localization in many aspects. Though system provides better efficiency it is not withstanding to hard images i.e., difficult to extract text from images. The solution is to consider colour information. Also False positive rate should consider as an issue based on the applications using this system.


[1] K. Jung, K. I. Kim, and A. K. Jain, “Text information extraction in images and and video : Asurvey, “Pattern Recogn, vol.37, no.5,pp.977-997, 2004.

[2] K. Zhang Y. Sheng J. Li , “Automatic detection of road traffic signs from natural scene images based on pixel vector and central projected shape feature” , IET journal Intelligent Transport Systems, 2012, Vol 6, Iss. 3, pp 282-291.

[3] Yi-Feng Pan, Yuanping Zhu, Jun Sun, Satoshi Naoi, “Improving scene text detection by Scale-adaptive segmentation and weighted CRF verification “, International conference on Document Analysis and Recognition, 2011, pp 759-763.

[4] Maryam Darab, Mohammad Rahmati, “ A hybrid approach to localize farsi text in natural scene images “, Procedia Computer Science 13 , 2012, pp 171-184.

[5] ZHU Kai-hua, QI Fei-hu, JIANG Ren-jie, XU Li, “ Automatic character detection and segmentation in natural scene images”, Journal of Zhejiang University SCIENCE A , 2007, 8(1), 63-71.

[6] Jonghyun Park, Gueesang Lee, “ A robust algorithm for text region detection in natural scene images “ , CAN. J . ELECT. COMPUT. ENG., Vol. 33, No. ¾, SUMMER/FALL 2008.

[7] J Weinman, E. Leaarned-Miller, and A. Hanson, “Scene text recognition using similarity and a lecxicca with sparse belief propogation.” IEEE Trans. Pattern Ancl. Mach. Intell., vol. 31, no.10, pp.1733-1746, 2009.

[8] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labelling sequence data” , in Proc. 18th Int. Conf. Machine Learning (ICML’01), San Francisco, CA, 2001, pp. 282-289.

[9] Winger, L., Robinson, J.A., Jernigan, M.E., “Low-complexity character extraction in low-contrast scene images”, International Jouranal of Pattern Recognition and Artificial Intelligence, 14(2):113-135. 2000.

[10 ] Keechul Jung, Kwang In Kim and Anil K. Jain, “Text information localization in images and video: A Survey”, Elsevier, Pattern Recognition, vol.37 (5), pp 977–997, 2004.

[11], “Efficient Automatic Text Location Method and Content Based Indexing and Structuring Of Video Database”, Journal of Visual Communication Image Representation,vol.7(4),pp336-344,1996.

[12] Mohieddin Moradi, Saeed Mozaffari, and Ali Asghar Orouji, “Farsi/Arabic Text Localization from Video Images by Corner Detection”, 6th, IEEE, Iranian conference on Machine Vision and image processing, Isfahan, Iran, 2010.

[13] Chung-Wei Liang and Po-Yueh Chen, “DWT Based Text Localization”, International Journal of Applied Science and Engineering, pp.105-116, 2004.

[14] Nikolaos G. Bourbakis, “A methodology for document processing: separating text from images”, Engineering Applications of Artifcial Intelligence 14, pp. 35-41, 2001.

[15] C. Strouthopoulos, N. Papamarkos, A. E. Atsalakis, “Text localization in complex color documents”, The Journal of Pattern Recognition 35, pp.1743–1758, 2002.