Improvements for scalable and accurate plagiarism detection in digital documents

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

1.0 Motivation

The steady increase in accessibility to digital documents via the internet has led to a perplexing problem concerning intellectual property security. While document authors want their original work to be available to others online, many are reluctant to expose their works to plagiarism, which has become easier with the digitization of documents and the availability of duplication tools. Strategies for detecting plagiarism are useful not only for authors wishing to protect their intellectual property, but also for course instructors hoping to ensure the integrity of assignments, authors attempting to eliminate multiple versions of the same document, and publishers trying to identify recycled works.

Varying definitions of plagiarism exist; for the purposes of this research, plagiarism will Easy to Detect Exact Document Copy be defined within the limits described in Paragraph Copy

1. The most widely acknowledged form is exact Sentence Copy document copy. If a document is an exact copy, Single-Word Changes flagging plagiarism is easy; however, more Sentence Structure Changes examination may be required to identify the Difficult to Detect whole document as copied. Other common of sophistication sentence copy, in which the copied units may be interspersed with other writing. More subtle forms of plagiarism, which are often difficult to detect and therefore neglected by copy detection mechanisms, include single-word changesin which entire sentences may be copied, but synonyms are substituted to prevent an exact matchand sentence structure changeswhere the majority of the words in a copied sentence are preserved, but the grammatical ordering may be altered.

Copy detection research attempts to define strategies to identify zero false-negative instances of plagiarism (in other words, to catch all plagiarism cases), and to minimize the number of false-positive instances (in other words, limit the number of cases needlessly requiring human inspection). Effective detection mechanisms must do this for plagiarism instances along the entire spectrum defined above: from exact document copy to word changes within single sentences. In addition, since the comparison space of online documents is growing rapidly, efficient computational strategies are a priority; both accuracy and performance must be optimized.

This research aimed to address two problems in improving plagiarism detection. First, with the huge number of original documents available for comparison, the corpus must be reduced to a manageable number of documents for comparison. Optimally, the subset selection strategy should predict which documents contain plagiarism instances, in order to catch all instances, while avoiding wasteful search. Two strategies for corpus subset selection are defined and compared.

The second question to be answered involves the strategy used for detecting plagiarism when comparing the test document against the narrowed group of target documents. This research proposes a similarity metric as efficient as existing metrics, but with improved ability to detect minor word changes. The proposed metric is evaluated against the cosine similarity metric.

Finally, an interface has been developed which allows the user to load a test document for comparison, and adjust threshold levels for performance and accuracy tradeoffs. For human decision-making, the interface displays portions of text suspected of plagiarism.

2.0 Related Work

Several algorithms for plagiarism detection have been proposed. One line of thinking emphasizes sequence, in which units, or chunks, of a test document are matched pair-wise against an original. Much testing has been done to determine optimal chunking granularity, from entire document sections, to paragraphs, to sentences; more subtle copy instances can be detected at finer granularities, but in return many more comparisons are required. Documents can even be chunked into somewhat random phrases using hashed-breakpoint chunking [6]. However, strategies like COPS, a system that utilizes various chunking strategies for exact comparison, cannot detect single-word changing [1]. The culling of uninteresting chunks, as in [2], can reduce comparison space. Another matching strategy involves the construction of structure-independent suffix trees [4].

A different approach emphasizes word frequency; the idea of chunking by words, in which relative word frequencies evaluate document similarity, is proposed by the SCAM mechanism [5]. The relative frequency concept is improved upon with the CHECK system, which recursively rejects dissimilar portions at refined granularities in an attempt to reduce the number of required comparisons [7]. CHECK is limited, however, by the assumption that plagiarism only occurs between documents of similar content. The techniques of using stoplists (very common words are not included in frequency lists) and stemming (words are converted to root form) can improve the relative frequency strategy by reducing the total number of words considered [3].

3.0 Plagiarism Detection Process

Because no open-source plagiarism tool was available for experimentation, a tool had to be constructed from scratch. For background, the overall plagiarism detection process is described here. As shown in 2, the process is composed of two main parts: 1) corpus building, and 2) test document comparison against the corpus. Corpus building occurs only once, at the instantiation of the system. First, the WebCrawler module searches the Citeseer website for documents in PDF format. Those documents are then converted to text units. Next, preprocessing occurs: common words are removed, words are stemmed to root words, and a word list is created, containing data about the frequency and number of occurrences of each word in the document. Word list data is then saved to a corpus repository. At this time, hierarchical clustering of documents is also performed, and information about the resulting dendogram is also stored.

Test document comparison is initiated within a user interface, where the test document is loaded and a selection is made about the time allowed for search, as shown in 3. This search time option allows the user to evaluate the tradeoff between time available for search and desired level of accuracy. Document comparison against the stored corpus begins by converting the test document to a word list using the same preprocessing strategy as described above. Then, based on the word attributes of the test document, a subset of the corpus is selected against which to compare the test document (the strategies for corpus subset selection are described in Section Four). The test document is then compared against each corpus document in the subset. Comparison follows a method similar to that used by the CHECK system [7]; if the computed similarity for two documents is above a specified threshold, the documents are recursively divided into smaller portions, and the portions are compared for similarity. This continues until the portions are sentence-length; two sentences with similarity above a specified threshold are deemed a match. This method makes the assumption that plagiarism occurs most often among texts with similar terminology (subject matter), but allows for valuable time-saving by rejecting dissimilar portions of text. Finally, as in 4, matches are reported to the user interface, which shows the alleged match in document context for human inspection.

4.0 Corpus Subset Selection

Two methods of corpus subset selection were explored. In the first method, agglomerative clustering was performed during corpus building on all documents within the corpus. In order to reduce computational complexity, summary information or centroids of clusters were used instead of single linkage or complex linkage. Cluster information was then maintained for three levels of the tree: high (a few large clusters, each requiring long search time), medium, and low (several small clusters, each requiring short search time), as shown in 5. During test document comparison, the cluster most similar to the test document was identified, at the level determined based on user specification for allowed search time.

High med low

The second method of corpus subset selection involved prioritization of documents based on similarity to the test document ( 6). All documents in the corpus were ranked, then searched in order of similarity. The number of documents searched was based on the users specification for allowed search time.

Test doc

These two methods were compared against a third, brute force approach, in which documents in the entire corpus were examined in random order. Again, the number of documents searched was based on the users specification for allowed search time.

An experiment was conducted to compare the accuracy and performance of each corpus subset selection method. A mini-corpus of 31 documents was used (attempts to build a larger corpus resulted in memory errors). Seven plagiarized test documents, which were exact copies of corpus documents, and seven innocent test documents. For time allotments from 5,000ms to 30,000ms, each of the three subset selection methods was used. The number of documents found by the plagiarism detection algorithm were counted, and the results are shown in 7.

Because of the initial computation required of the clustering and prioritization methods, the brute force method identifies matches most quickly for very short search times. However, once search times are increased to allow the clustering method to identify the correct cluster, the clustering method proves to be the best method for corpus reduction. The prioritization method is not as effective as the clustering method, primarily because of the extensive time required to rank all documents.

5.0 Similarity Metric

Classification of plagiarism instances requires that two textual excerpts meet an assigned minimum similarity threshold. Similarity can be measured in several ways; a quality metric should be robust enough to detect as plagiarism sentences with minor word reordering or substitutions, in addition to exact sentence matches. Two similarity metrics were evaluated for performance and accuracy. The first was a simple, on-the-fly metric generated by the authors, measured as the average reciprocals of differences in normalized word frequencies. The first metric in 8 shows the corresponding equation. The Simple metric was evaluated against the Cosine Similarity metric, as given by the second equation in 8.


Sim(Di, Dj) =1 dik1 .djknk=01+

t (dik.qk) CosSim(Di,Q) =




dik.qkk=1 k=1

In order to compare the two similarity metrics, ten innocent (non-plagiarized) pairs of sentences and ten plagiarized pairs of sentences were classified by each metric (see 9 for sentences used). Although the metrics were normalized to provide similarity values between zero and one, the different equations can yield difference values for the same sentence pairs. Therefore, plagiarism classification for each metric was performed using several thresholds (increments of 0.1 from zero to one), and the best performing (most accurately-classifying) threshold for each was used for the comparison. The Simple metric performed best with a threshold of 0.05, while the Cosine metric performed best with a threshold of 0.55. Confusion matrices were generated for each metric at its optimal threshold, and an ROC chart for each was generated, as shown in 10. As the shows, the Simple similarity metric classified correctly more often than the Cosine similarity metric under their respective optimal thresholds. Difference in computation time between the two metrics was negligible, which was as expected, since the computational complexity of each is O(n).

Innocent Sentence Pairs

1a.) "I love her." 1b.) "I like him."

2a.) "The dog ran to the big fat store." 2b.) "Did you see the dog running to the big store?"

3a.) "The shows the result."3b.) " 4 shows the result."

4a.) "The dog ran to the big fat store." 4b.) "The dog ran."

5a.) "See the top 10 most fuel efficient cars." 5b.) "Did you see the car running at 10 mph?"

6a.) "I saw a man in the room."6b.) "Did you see the room?"

7a.) "You have a confirmed reservation." 7b.) "He reserved it."

8a.) "Did you find him?" 8b.) "You did it to him."

9a.) "If you did not submit a change please call our customer service center." 9b.) "If you call our customer service center, please submit a change."

10a.) "Can I take your order?"10b.) "I ordered your can."

Plagiarized Sentence Pairs

1a.) "The web has greatly improved access to scientific literature." 1b.) "Access to scientific literature has been greatly improved by the web."

2a.) "However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages." 2b.) "However, scientific articles on the web are largely disorganized, with research articles being spread across institution sites,researcher homepages, journal sites, and archive sites."

3a.) "We study long-term coalitions of customer and vendor software agents based on trust relationships among the agents."3b.) "We explore long-term coalitions of customer and vendor software agents considering trust relationships."

4a.) "There are only a few studies on coalition formation for the electronic marketplace." 4b.) "There are a few researches on coalition formation for the electronic commerce."

5a.) "A referral system is a multiagent system whose member agents are capable of giving and following referrals."5b.) "A referral systems are multiagent systems whose members are capable of giving and following referrals."

6a.) "Agent-based systems require flexibility to perform effectively in complex and dynamic environments." 6b.) "Agent-based systems requires flexibility to perform effectively in dynamic and complex situations."

7a.) "The second method of corpus subset selection involved prioritization of documents based on similarity to the test documents."7b.) "The third method of subset selection involves prioritization of documents based on similarity."

8a.) "As she approached me on the path, she looked about 9 years old, all dressed up in her Sunday best, and her freshly scrubbed face, just gleaming with cutsiness." 8b.) "As he approached me on the path, he looked about 11 years old, all dressed up in his Sunday best, and hisfreshly scrubbed face, just gleaming with cutsiness."

9a.) "The Information Service Agents Lab (ISA Lab) is a software research and development laboratory located atSimon Fraser University in Burnaby, BC, Canada." 9b.) "The Intelligent Processes and Systems Lab (LIPS Lab) is a software research and development laboratory located at University of Texas, Austin, Texas."

10a.) "Microsoft Agent can be a powerful extension and enhancement of the existing interactive modalities of theMicrosoft Windows interface."10b.) "Lips sensible agent can be a powerful extension and enhancement of the existing interactive modalities of anyinterface."

11a.) Receiver Operating Characteristic Curve Cosine Similarity Simple Similarity 1-specificity.

6.0 Conclusions

The goals of this project were to 1) assess corpus reduction methods for accuracy and performance, 2) assess similarity metrics for accuracy and performance, and 3) create a tool for detecting plagiarism in documents. This research has shown that the proposed clustering method for corpus subset selection significantly reduces computation time as compared to existing algorithms, which perform no corpus reduction. While clustering of corpus documents requires extensive computational overhead, that computation is an ahead-of-time procedure which occurs before documents are tested. The clustering algorithm used for this project also required a large amount of memory; however, techniques to be explored in the future, such as word list indexing, could greatly reduce the memory needed for clustering data. The limitation in corpus size meant the corpus reduction algorithms could not be tested on a larger, more realistic corpus; the results obtained in these experiments might not be consistent with a larger corpus.

The Simple similarity metric developed by the authors empirically shows comparable performance to the Cosine similarity metric. In addition, it requires less computation from implementation point of view, although both show the same computational complexity of O(n). For the sample sentence pairs tested, the Simple metric classified more accurately. However, a disadvantage of this metric is that it has one dimensional interpretation of similarity, since it averages all scalar values of word frequency in each dimension. Also, the theoretical implications of the Simple metric have not been completely evaluated. In addition, this experiment was somewhat subjective, in that the authors relied upon their own judgment in labeling sentence pairs as innocent or plagiarism. To provide conclusive evidence of better accuracy, more example sentence pairs should be tested.

Plagiarism detection poses numerous problems which, due to this projects time limitations, we were unable to explore. The plagiarism detection algorithm was limited to a recursive word frequency comparison strategy similar to that used by CHECK, which assumes that plagiarism only occurs in documents with similar topics and vocabularies. This assumption is necessary for any kind of corpus subset selection. By reducing the number of corpus documents examined, time saved can be spent on more detailed comparison of individual documents. Future work would include exploration of metrics other than just word frequency, such as grammatical patterns and word locations.

While the improvements proposed in this project allow for more scalable plagiarism detection, the time-accuracy tradeoff cannot be ignored. It is impossible to capture every single case of plagiarism when performing corpus subset selection, no matter how good the selection method or similarity metrics utilized. One-hundred-percent accuracy can only be obtained by searching every sentence of every document in the entire corpus, which is an impossible task as the corpus of available digital documents continues to grow. In addition, while the tool developed in this project can identify potential plagiarism instances, human judgment is still required to provide a definitive classification.

7.0 References

[1] Brin, S., Davis, J., and Garcia-Molina, H. Copy Detection Mechanisms for Digital Documents. In Proceedings of ACM SIGMOD Annual Conference, San Jose, CA, May 1995.

[2] Finkel, R., Zaslavsky, A., Monostori, K., and Schmidt, H. Signature Extraction for Overlap Detection in Documents. In Proceedings of Australasian Computer Science Conference, 2002.

[3] Hoad, T., and Zobel, J. Methods for Identifying Versioned and Plagiarised Documents. Journal of the American Society for Information Science and Technology. To appear.

[4] Monostori, K., Zaslavsky, A., and Bia, A. Using the MatchDetectReveal System for Comparative Analysis of Texts. In Proceedings of Australasian Document Computing Symposium, 2001.

[5] Shivakumar, N. and Garcia-Molina, H. SCAM: A Copy Detection Mechanisms for Digital Documents. In Proceedings of International Conference on Theory and Practice of Digital Libraries, Austin, Texas, June 1995.

[6] Shivakumar, N. and Garcia-Molina, H. Building a Scalable and Accurate Copy Detection Mechanism. In Proceedings of ACM Conference on Digital Libraries, Bethesda, Maryland, March 1996.

[7] Si, A., Leong, H., and Lau, R. CHECK: A Document Plagiarism Detection System. In Proceedings of ACM Symposium for Applied Computing, pp.70-77, Feb. 1997.