Current State of Art and Writer Identification

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The importance of writer identification has become more significant in these days. This can be used in wide areas, such as, digital rights management in the financial sphere, to solve the expert problems in criminology by forensic expert decision-making systems, where a narrowed-down list of identified writers provided by the writer identification system. By combining with the writer verification as an authentication system this can be used to monitor and regulate the access to certain confidential sites or data where large amounts of documents, forms, notes and meeting minutes are constantly being processed and managed, knowing the identity of the writer would provide an additional value. It can also be used for historical document analysis [1], handwriting recognition system enhancement [2] and hand held and mobile devices [3]. To a certain extent its recent development and performance consider as a strong physiologic modalities of identification, such as DNA and fingerprints [4]. However, the number of researchers involved in this challenging problem is going high as a result of these opportunities.

The handwriting-based writer identification is an active research arena. As it is one of the most difficult problems encountered in the field of computer vision and pattern recognition, the handwriting-based writer identification problem faces with a number of sub problems like a) designing algorithms to identify handwritings of different individuals b) identifying relevant features of the handwriting c) basic methods for representing the features d) identifying complex features from the basic features developed and d) evaluating the performance of automatic methods.

Until 1989 a comprehensive review of automatic writer identification had been given in [5]. As an extension the work from 1989 -1993 has been published in [6]. Consequently the approaches proposed in the last several years renewed the interests in this scientific community for the research topic. The following Figure 1 describes the standard framework of writer identification [7].

Fig. 1 Writer Identification framework [7]

Based on the input method of writing, automated writer identification has classifieds into on-line and off-line. The on-line writer identification task is considered to be less difficult than the offline one as it contains more information about the writing style of a person, such as speed, angle or pressure, which is not available in the off-line one. [8, 9]. Based on the different features associated with the writing, such as character, word, line, paragraph and the document, this has classified. The Figure 2 shows the taxonomy of the classification mentioned above.

Fig. 2 Taxonomy of writer identification w.r.t features of the writing

Text-dependent & text-independent are the other classification of automated writer identification. Dependent on the text content, text-dependent methods only matches the same characters and requires the writer to write the same text consequently. The text-independent methods are able to identify writers independent of the text content and it does not require comparison of same characters. Thus it is very similar to signature verification techniques and uses the comparison between individual characters or words of known semantic (ASCII) content. This method considers as the global style of hand writing text as the metric for comparison, and also got better identification results. As it requires the same writing content this method is not apt for many practical situations. Even though it got a wider applicability, text-independent methods do not obtain the same high accuracy as text-dependent methods do.

The following section describes the various approaches addressed for writer identification in different languages.

Chinese, English and other languages

A text independent approach for writer identification, that derives writer specific texture features using multichannel Gabor filtering & Gray Scale co-occurance Matrices had proposed in the end nineties. The framework needed uniform blocks of text, which developed by word deskewing, and also setting a predefined distance between text lines/words and text padding. In this experiment two sets of twenty writers and 25 samples were used. By using weighted Euclidean distance, Gabor features achieved 96 percent writer identification accuracy & the nearest centroid classification reveals that the two-dimensional Gabor model outperformed gray-scale co-occurrence matrix. on machine print documents for script [16] and font [17] identification a similar approach has also been used.

In 2000 Zois and Anastassopoulos implemented writer identification and verified by using single words. 50 writers were performed in this experiment on a data set both in English & Greek.. The word "characteristic" had been written 45 times by each writer., the horizontal projection profiles were resampled after image thresholding and curve thinning divided into 10 segments, and also its been processed using morphological operators at two scales to obtain 20-dimensional feature vectors. Classification was performed using either a Bayesian classifier or a multilayer perceptron & it shows 95% accuracy for both English and Greek words According toMarti, Hertel and Bunke [31], text lines were the basic input unit, from which text-independent features are computed using the height of the three main writing zones, ie; slant and character width, the distances between connected components, the blobs enclosed inside ink loops, the upper/lower contours, and the thinned trace processed using dilation operations. in test cases on a subset of the IAM database with fifty writers and five handwritten pages per writer the identification rates exceeded 92 percent by using a k-nearest-neighbour classifier. The IAM data set will also be used in the current study.

A methodology to identify the writer of numerals were proposed by Graham Leeham features parameters such as height, width, area, center of gravity, slant, number of loops, etc.and it was tested among fifteen people and the accuracy was 95%, though the precise accuracy it should be verified across larger databases to determine. A large number of features had been proposed by Srihari, which can be classified into two categories. Those are Macrofeatures and microfeatures - The first one operate at document/paragraph/word level. The parameters used are gray-level entropy and threshold, number of ink pixels, number of interior/exterior contours, number of four-direction slope components, average height/slant, paragraph aspect ratio and indentation, word length, and upper/lower zone ratio. The second one, ie; Microfeatures - They operate at word/character level. The parameters comprise of gradient, structural, and concavity (GSC) attributes. These features were originally applied for handwritten digit recognition. Text-dependent statistical evaluations were performed on a data set containing thousand writers who copied a fixed text of 156 words (the CEDAR letter) three times. In writer identification methodologies this is the largest data set ever used until now & the microfeatures outperform macrofeatures with an accuracy exceeding 80 percent in identification tests. With an accuracy of about 96 percent a multilayer perceptron or parametric distributions were used for writer verification. Writer discrimination was also done using individual characters in [22],[23] and using words in [24], [25].

To encode the individual characteristics of handwriting independent of the text content Bensefia use graphemes generated by a handwriting segmentation method & it is very similar to our allograph-level approach in these studies. To define a feature space, which is common for all documents in the data set, Grapheme clustering was used & the

experimentations were done on three data sets containing 88 writers, 39 writers (historical documents), and 150 writers, with two samples (text blocks) per writer. While writer verification was based on the mutual information between the grapheme distributions in the two handwritings, which were used for comparison, writer identification was performed in an information retrieval framework,. Concatenations of graphemes are also analyzed in the mentioned papers. On the different test data sets about 90 percent accuracy was reported & a feature selection study is also performed in.

Using a codebook of graphemes in the IAM and PSI databases Ameur Bensefia have developed a probability based approach & the system accuracy was 95% in IAM database and 86% in PSI database. A combination of simple directional features and codebook of graphemes [41] have been also used by Laurens van der Maaten . the system accuracy was 97% when the method was tested on 150 writers. On English identification system Vladimir Pervouchine only focused on letters ''t'' and ''h'' and their skeletons were extracted after detecting these shapes in the image. The similarity of cost functions identifies the writer [42] and then its been calculated along with the curve. It is obvious that this method cannot be extended for other languages. Based on fragmented connected-component contours (FCO3) [35, 36] Schomaker has introduced a method. In the classification phase to calculate distance they used the w2 method and also they tested it in an English data set with 150 writers, in which the top-1 of the method results had 72% and the top-10 had 93% accuracy. However, the top-10 results were satisfactory but its top-1 was not.

Schlapbach presented an HMM based writer identification and verification method [37, 38]. An individual HMM was designed and trained for each writer's handwritin & to determine which writer has written an unknown text, the text is given to all the HMMs. The one with biggest result is assumed to be the writer. By using documents gathered from 650 writers this identification method was tested & the accuracy was 97%. This method was tested as a writer verification method also. With the collections writings from 100 people and twenty unskilled and twenty skilled imposters, who forged the originals, this accuracy was achieved . Experimentations results obtained showed about 96% overall accuracy in verification. By using some changes on feature extraction phase this method can be extended to other languages also. The difference between the two writer identifications schemes in [39] and [40] is that the former was used in English handwriting and got about 80% accuracy in top-1 results and about 92% in top-10 results while the latter supported Arabic handwriting and its accuracy was 88% in top-1 and 99% in top-10 results.

Based on high frequent characters Vladimir Pervouchine in 2007, et al. [34] introduced a writer identification scheme. In this method, the high frequent characters ('f','d','y','th') are identified firstly, and then the writer is selected according to the similarity of those characters and associated with the characters the similarity is calculated with respect to the features (such as height, width, slant, etc.). The number of features associated with each character is different (e.g. 'f' had 7 features while 'th' had 10 ones). In the classification phase Aasimple Manhattan distance was used. In order to select the best subset of the features, a GA was used which evaluated about 5000 of the subsets, out of 231 possible subsets. In a database with 165 writers the system was tested (between 15 to 30 patterns per writer), and the accuracy was exceeded 95%. Though the main concern of this method is that if a writer knows the procedure of method, he/she can write a text in test phase such that its characters are totally different with trained ones and so that the system cannot identify him/her, this method is simple and has good results.

Again in 2007 Bangy [10] used the feature vector of hierarchical structure in shape primitives along with the dynamic and static feature for writer identification for 242 writers using NLPR online database and attained a result of above 90% for Chinese and about 93% for English. The substantiation given is that English text contains more oriental information than Chinese text. Zhenyu He in 2008 proposed an offline Chinese writer identification scheme which used Gabor filter to extract features from the text and they also incorporated a Hidden Markov Tree (HMT) in wavelet domain. Against a database containing 1000 documents written by 500 writers this system was tested. Each sample contained 64 Chinese characters. The top-1, top-15, and top-30 results got 40%, 82.4%, and 100% accuracy, respectively [12] and also a combination of general Gaussian model (GGD) has been used by authors and wavelet transform on Chinese handwriting in Ref. [13]. On a database gathered from 500 people they tested the method and this database consisted of 2 handwriting images per person. In the experiments, top-1, top-15 and top-30 results had 39.2%, 84.8% and 100% accuracy, respectively. The authors reported that the accuracy of proposed methods was low especially in top-1.

In 2009, based on Fast Fourier Transformation YuChenYan et al. [11] introduced spectral feature extraction method which was tested on the 200 Chinese handwriting text collected from 100 writers and it showed 98% accuracy for top 10 and 64% for top1 using the Euclidean and WED classifiers. With stable features it reduces the randomness in Chinese character. Though it has higher computation costs it is feasible for large volume of dataset.

1.2 Arabic

By combining some textural and allographic features [40, 45] Bulacu et al. proposed text-independent Arabic writer identi­fication. A probability distribution function was generated and the nearest neighbor­hood classifier using the x2 as a distance measure was used after extracting textural features (mostly relations bet­ween different angles in each written pixel). A codebook of 400 allographs was generated from the handwritings of 61 writers for the allographic features and the similarity of these allographs was used as another feature, In this experiments the database consisted of 350 writers with 5 samples per writer (each sample consisted of 2 lines (about 9 words)). The accuracy was 88% in top-1 and 99% in top-10. Also, a simpler definition of this method was presented by M. Bulacu et al. earlier in [46].

Also,. By using different feature extraction methods such as hybrid spectral-statistical measures (SSMs), multiple-channel (Gabor) filters, and the grey-level co­occurrence matrix (GLCM) Ayman Al-Dmour et al. designed an Arabic writer identification system in 2007 [47] to find the best subset of features. A support vector machine (SVM) was used to rank the features for the same purpose and then a GA (whose fitness function was a linear discriminant classifier (LDC)) chose the best one. Classification methods such as LDC, SVM, weighted Euclidean distance (WED), and the K nearest neighbors (KNN) were also considered. The KNN-5, WED, SVM, and LDC results after feature selection per sub-images were reported as 57.0%, 47.0%, 69.0% and 90.0%, respectively. When the whole image was used the results were better, for instance the LDC result was exceeded to 100% (with no rotation). From 20 writers the database was tested and each writer was asked to copy 2 A4 documents, one for training and the other one for testing. The used documents were different for each writer from the others and the sub-images developed by dividing each document into 3x3 = 9 non-overlapping images. It seems the test database and samples per writer was small and it has to be tested on more popular dataset. However this method has good accuracy when LDC was used. A set of 16 Gabor filters [48] for handwriting texture analysis was opted by Faddaoui and Hamrouni. In the form of lifting scheme wavelet transforms Gazzah and Ben Amara applied spatial-temporal textural analysis. In the task of Arabic writer identification [49]Angular features were considered.

Somaya Al-Ma'adeed et al. Introduced. a text-dependent writer identification method in Arabic using only 16 words [44]. WED has been used as classifier with some edge-based directional features such as height, area, length, and three edge-direction distributions with different sizes. The test data was 32 000 Arabic text images from 100 people; with 75% of the data the system was trained and by using 25% it was tested. The top-1 accuracy of the method they did not mention, but when 3 words were used the best result in top-10 was 90%. The dependency to text and the small dataset that were used in experiments was the main concern of this method. Edge-based directional probability distributions employed in this method, combined with moment invariants and structural word features, such as area, length, height, here the length from baseline to upper edge and length from base line to lower edge. For the writer identification scheme Abdi et al. used stroke measurements of Arabic words, such as length, ratio and curvature, in the form of PDFs and cross-correlation transform of features [50].

Even though the Arabic language similar to Persian in character set and some writing styles, because of some special symbols that exists in Arabic language the Arabic methods may not be extended to Persian language completely.

1.3 Persian

AGabor based system for Persian writer identification and the accuracy of their work was reported about 92% in top-3 and 88% in top-1[51].Which was proposed in 2006 by Shahabi et al. It is observed In the test phase, there was only one page per person such that 34 of it were used in training and the rest of page used in test phase. So the testing was not adequate. We have implemented and tested their method to verify these results in more general way; where 5 pages for each writer were used in training phase and another separate page was used in test phase; the result was 60% accuracy in 80 people. Soleymani Baghshah et al. introduced a fuzzy approach for Persian writer identification [57]. A fuzzy directional features were used in this method and the fuzzy learning vector quantization (FLVQ) has been trained in order to recognize the writers. But it only works on disjoint Persian characters that are not conventional in Persian language. Using 128 writers this system was tested and results were around 90%-95% in different situations of test.

Based on a new generation of Gabor filter, that was called XGabor filter, a Persian handwritten identification system was proposed in 2008 [52]. Feature extraction was done using Gabor and XGabor filters; in the classification phase, weighted Euclidian distance (WED) classifier was used. In order to test the system, a data set of 100 people's handwritings we organized which has been referred by some other works also. Referenced by this word in present paper this data set is called PD100. This method[52] got 77% accuracy using the PD100. Using baseline and width structural features, and relying on a feed forward neural network for the classification Rafiee and Motavalli [58] designed a new Persian writer identification method.

We proposed an LCS (longest common subsequence) in another recent work, to classify features that are extracted by Gabor and XGabor filters [53,54]. This classifier got accuracy up to 95% on PD100. The accuracy of these methods was not proper because of problems in data classification and representa­tion. However, the features extracted by XGabor filter could model the characteristic of written documents. With different data representation, classification, and identification schemes we used XGabor filter in the present paper together with Gabor filter. A mixture of some different methods has been used in another research by Sadeghi ram et al. By fuzzy clustering method and after selecting some clusters Grapheme based features are clustered and the final decision is made based on gradient features. This method achieved 90% accuracy in average on 50 people that were selected randomly from PD100 [55]. To classify the gradient based features they also used a three layer MLP(multi layer perceptron), and find 94% average accuracy on same data set [56]. To the best of our knowledge, there is no other reported method in Persian writer identification.

Table 1 summarises the Writer Identification Methods on Multiple Languages.

Table1. Writer Identification Methods on Multiple Languages


Sample space





Text -dependent

Srihari et al.s [19, 59]

1000 writers (CEDAR letter / paragraph / word)

Two levels of features; one at the macro level, micro level.

multi-layer perceptron



Zois et al[18]

50 writers (45 samples of the same word)

The horizontal projection profiles are resampled into 10 segments, and processed using morphological operators

Bayesian classifiers and neural networks

95% for both English and Greek

English and Greek

Tomai et al. [25]

1000 writers

Character level, Word level features

Euclidean distances



Zuo et al. [60]

40 writers

Offline PCA based method

Squared Euclidean distance



Zhang et al. [22]

1000 writers

Gradient (192

bits),structural (192 bits), and concavity (128 bits) features





Somaya Al-Ma'adeed et al.[44]

100 writers(320 words(16differenttypes))

Height area, length and Edge -direction distribution

WED calssifier

Top-10: 90%


Schlapbach et al.[8]

200 writers(8 paragraph of about 8 lines)

Point-based (speed, acceleration, vicinity linearity, vicinity slope), stroke-based (duration, time to next stroke, number of points, number of up strokes, etc.),

Gaussian mixture model (GMM)




Pitak et al. [61]

81 writers

velocities of the barycenter of the pen movements

Fourier transformation approach



Schlapbach et al. [62].

100 writers

X-Y coordinates

Hidden Markov Models




Said et al. [15 ], , T. Tan [16], Y. Zhu [17]

Two sets of 20 writers, 25 samples per writer (Few lines of handwritten text)

texture features using multichannel Gabor filtering and gray-scale co-occurrence matrices

Nearest centroid classification using weighted Euclidean distance



Bensefia et al. [26], [27], [28], [29]

88 writers (French), 150 writers (English)

A textual based Information Retrieval model, local features such as graphemes extracted from the segmentation of cursive handwriting

Cosine similarity

95% on 88 writers 86% on 150 writers


S. K. Chan [62]

82 writers

namely x-y coordinates, direction, curvature of x-coordinates and

the status of pen up or pen down.

Discrete Character prototype distribution approach (Euclidean distance)



Marti et al. [30] and Hertel and Bunke [31]

20 writers (5 samples of the same text)

Height of the three main writing zones, the distances between connected components

a k-nearest neighbor and a feed forward neural network classifier



M. Bulacu [46],[63],[64],[65]

650 writers

Edgebased directional PDFs as features (Textural and allograph prototype approach)


neighbor and a feed forward neural network classijiel



Guo Xian Tan Christian[66]

120 writers

Continuous Character prototype distribution approach


distance classifier



Neils et al.[67]

43 writres

Allograph prototype matching approach using the dynamic

time warping (DTW) distance function

af-iwf (allograph frequency - inverse writer frequency) measure



B. Helli, et al.[45], [53], [54]

100 writers( PD100 dataset)50 writers[46]

Point-based (speed, acceleration, vicinity linearity, vicinity slope), stroke-based (duration, time to next stroke, number of points, number of up strokes, etc.).

Tey proposed an LCS (longest common subsequence) based classifier



Bangy Li

et al. [10]

242 writers(NLPR online handwriting

Database and 50 Chinese and English words in

one page)


Structure in


Primitives +


Dynamic and

Static Features

nearest neighbor classifier







English and

Chinese text


Yan et

al. [11]

200 handwritings from 100 writers

Spectral feature based on Fast FourierTransformation

Euclidean and WED classifiers


-top 10