Mixed Method Approach For Efficient Component Retrieval Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

A continuing challenge for software designers is to develop efficient and cost-effective software implementations. Many see software reuse as a potential solution; however, the cost of reuse tends to outweigh the potential benefits. The costs of software reuse include establishing and maintaining a library of reusable components, searching for applicable components to be reused in a design, as well as adapting components toward a proper implementation. In this context, a new method is suggested here for component classification and retrieval which consists of K-nearest Neighbor (KNN) algorithm and Vector space Model Approach. We found that this new approach gives a higher accuracy and precision in component selection and retrieval process compared to the existing formal approaches.

Keywords: Software reuse, Component Retrieval, Vector Space Model algorithm


Many software organizations realized that developing the software using reusable components could dramatically reduce development effort, cost and accelerate delivery. But the non-existence of a standard searching technique for finding the suitable component and also the lack of appropriate tool in this field contributed towards in large-scale failures in their approach. From the past studies on this field, it is found that researchers are tried with different approaches to improve the adaptability of the component but very few studied had taken place in improving the efficiency of component retrieval. Fuzzy linguistic approach is familiar in the information retrieval process [5]. In this paper we have used an algebraic model namely Vector Space model in which text documents are represented as vectors of identifiers, such as, index terms which is used in information filtering, information retrieval, indexing and relevancy rankings along with K-Nearest Neighbor(KNN) algorithm for classification of documents. The paper is structured as follows. Section 1 starts with a discussion of what is meant by software reuse and a reusable component. Section 2 talks about present scenario in the component retrieval process. Section 3 presents methodology adopted. Section 4 mentions some concluding remarks and the relevance of related models which can be extended from the vector space model and various combination algorithms which can contribute towards further extension of this work are pointed out.

1.1 What Is Software Reuse?

Software reuse at its most basic level consists of making use of any existing information, component or product when designing and implementing a new system or product.

There are differing opinions as to which activities constitute genuine software reuse. Replication of an entire software program does not count as reuse. Reuse of assets is dependent upon both similarities and differences between the applications in which the component is being used [2].

Many organizations already practice a limited form of reuse, for example, most developers have libraries of components that they have developed in previous projects, or they use standard libraries, which are available with many programming languages [5]. About 30% of the cases, it is a very ad-hoc method of reuse, and it will work very well on a small scale and it will not be suitable for entire organizations [4]. Instead, businesses need to implement a systematic reuse program in order to gain the full advantages of reuse.

1.2 What Can Be Reused?

The definition of a reusable component is "any component that is specifically developed to be used, and is actually used, in more than one context" [4]. This does not just include code; other products from the system lifecycle can also be reused, such as specifications and designs, and even requirements on occasion [3]. 'Components' in this case can be taken to include all potentially reusable products of the system lifecycle, including code, documentation, design, requirements etc.

There are various criteria that should be satisfied in order for an asset to be successfully reusable. These are grouped into General, Functional and Technical requirements [13]. General requirements focus on aspects such as compliance with relevant standards, completeness, modularity and simplicity. All components should conform to the General requirements. Functional requirements include such concerns as which business processes it will simulate or automate, and how well it does this. Functional requirements mainly concern Vertical or Domain-specific assets and tend to be very specific to each information domain. Lastly, Technical requirements refer to criteria such as interoperability, portability, communication, security etc [2].

There are different levels of reuse, which can be considered [3]. At the highest level, entire applications can be reused on different platforms provided they are portable. Sub-systems can be reused within different applications, possibly within different domains. Reusable assets can be also being built in-house, retrieved from legacy systems or can be bought from an external source [4].

2. Present scenario in the component retrieval process

Existing approaches to software component retrieval process cover a wide spectrum of component encoding methods and search or matching algorithms. The encoding methods differ with respect to their soundness, completeness, and the extent to which they support an estimate of the effort it takes to modify a component. Text-based encoding and retrieval is neither sound nor complete. Its disadvantages have been thoroughly in the information retrieval literature [11][13]. Lexical descriptor-based encoding approach also suffers from a number of problems about developing and using classification vocabulary [9]. Software specific challenges include the fact that one-word or one-phrase abstractions are hard to come by in the software domain [12]. From the user's point of view, lack of familiarity with the vocabulary is also pointed out as draw back in using a component retrieval system effectively [14]. In this context Vector space Model will be a promising solution for component retrieval process [10][15].

3. Methods Used

3.1. Vector Space Model Approach

It is an algebraic model in which documents and queries are represented as vectors [16] as follows:

dj = (w1,j,w2,j,...,wt,j)

q = (w1,q,w2,q,...,wt,q)

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. An indexed collection of documents is represented as a term table which has documents as fields and words as primary key for row. The (D)i(Word)j-th entry of this table records how many times the j-th search term appeared in the i-th document.

Figure 1. A Sample Vector Space Model

The first major component of a vector space search model is the concept of a term space. A term space consists of every unique word that appears in a collection of documents. The second major component of a vector space search model is term counts. Term counts are simply records of how many times each term occurs in an individual document. This is represented as a table. By using the term space as a coordinate space, and the term counts as coordinates within that space, we can create a vector for each document. As the number of terms increases, the dimensionality of VSM also increases.

Figure 2. Structure of Word database

Figure 3. Structure of Rank table

For these words documents and corresponding ranks will be stored in the rank table.

Based on the ranking terms are compared as 'ranked higher than', 'ranked lower than' or 'ranked equal to' the second, making it possible to evaluate complex information according to query criteria. Here search vector space search model ranks the documents it finds according to the estimation of their relevance, making it possible for the user quickly to select the components according to their requirements[9] [15].

Relevancy rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as same kind of vector as the documents.

It is easier to calculate the cosine of the angle between the vectors instead of the angle:

A cosine value of zero means that the query and document vector are orthogonal and have no match (i.e. the query term does not exist in the document being considered).

3.2 The Document Classification Algorithm Employed

KNN classifier is an instance-based learning algorithm that is based on a distance function for pairs of observations, such as the Euclidean distance or Cosine. The k-Nearest Neighbour (kNN) classifier algorithm has been studied extensively for text categorization by Yang and Liu [21]. In this classification paradigm, k nearest neighbors of a training data are computed first. Then the similarities of one sample from testing data to the k nearest neighbors are aggregated according to the class of the neighbors, and the testing sample is assigned to the most similar class. The similarity in score of each neighbour document to the test document is used as the weight of the categories of the neighbour document [19]. If there are several training documents in the k nearest neighbour, which share a category, the category gets a higher weight. In this work, we used the Cosine distance to calculate the similarity score for the document representation.

One of advantages of KNN is that it is well suited for multi-modal classes as its classification decision is based on a small neighborhood of similar objects (i.e., the major class). So, even if the target class is multi-modal (i.e., consists of objects whose independent variables have different characteristics for different subsets), it can still lead to good accuracy. A major drawback of the similarity measure used in KNN is that it uses all features equally in computing similarities. This can lead to poor similarity measures and classification errors, when only a small subset of the features is useful for classification [22].

3.1.1 Steps for KNN using average cosine

Step 1: Select k nearest training documents, where the similarity is measured by the cosine between a given testing document and a training document.

Step 2: Using cosine values of k nearest neighbors and frequency of documents of each class i in k nearest neighbors, compute average cosine value for each class i, Avg_Cosine (i).

Step 3: Classify the testing document a class label which has largest average cosine.

In order to reduce the dimensionality of VSM and keep useful information, we first compute concept vectors for given categories. Then, using the concept vectors as projection matrix, projection of both training and testing data is done. Finally, we apply KNN algorithm on the projected VSM model that has reduced dimensionality.

3.1.2 Steps of Combined Approach for Vector based algorithm and K-Nearest Neighbor algorithm

Step 1: Compute a concept vector for each category using true label information of training documents and then construct concept vector matrix C (w-by-c), where c is the number of categories.

Step 2: Do projection of VSM model A (w-by-d) using concept vector matrix C(w-by-c) (i.e., C^T * A )

Step 3: Apply KNN with the projected VSM model (i.e., c-by-d matrix).


A novel KNN classification algorithm combining model and evidence theory is proposed in this paper. The new method not only overcomes the main shortage of lazy learning in traditional KNN, but also takes the distances between samples to be recognized and samples in k-neighbors into account. At the same time the method resolves the unrecognizable cases of unknown samples. Applying the classification algorithm into the document recognition, experimental results show its satisfied recognition rate and fast categorization speed.

There are also models based on and extending the vector space model such as generalized vector space model, Topic-based Vector Space Model and latent semantic indexing etc and also combination algorithms which consists of clustering, Singular Value Decomposition(SVD)-based Algorithm, Naive Bayesian Algorithm and variations of KNN algorithm. Future work can be aimed in these directions.