This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
A research paper presents the findings of an investigation on a scholarly topic. Â Researchers present their papers in conferences. Research papers are also published in journals and magazines. Researchers use the findings from other research papers as a foundation for their investigations, and reference these sources as a citation. Once they are published, researchers would like to know if anyone has cited their findings. This is possible, to some extent, in currently available search engines, like Google Scholar, CiteSeerX etc. It is a common occurrence that different researchers are working on the same field, completely unaware of each other's existence. It would be tremendously useful to researchers if they are informed of the latest findings in their area of research. This would enable them to supplement each other's work, save a lot of time spent on re-inventing the wheel and further their area of interest quickly and easily.
1.1 Brief topic area overview
There are several freely available search engines for researchers to find material which would assist their research, the most popular being Google Scholar, CiteSeerX, CS-Structure. These systems have their own pros and cons. For instance, Google Scholar has an excellent repository and a very good mail alert feature which subscribes users for notification, based on authors or key-words. However, since it does not categorize papers based on their area of research, it does not allow notification based on area of research. CS Structure, on the other hand has a very good categorization schema, but it does not support notifications.
Our proposal is to combine the features of mail alerts of Google Scholar with the strong categorization schema of CS Structure, in order to provide the user with an efficient search feature along with timely updates. In addition, the proposed search engine would incorporate some level of intelligence and suggest research papers which are relevant, even though they belong to other areas of research. This would enhance the user experience and considerably cut down the time spent on searching.
1.2 Research Questions or problems
Before we propose a solution, it is imperative to take a look at some of the concerns or factors that would play a significant role in shaping our solution. These are summarized below:
Data Sources: We need to have a clear indication of the sources that we are going to refer to in order to generate our results. This involves identifying sources which are both popular as well as comprehensive.
Categorizing the text: We need to find an effective approach which would work efficiently for the automatic categorization of a large number of documents.
Classification framework: Deciding the classification structure is critically important. The information should be available in a way in which the user can easily access the data by cognitively anticipating the categories and subcategories.
Size and scalability of the knowledge database: A limit must be fixed for the size of the knowledge database and the number of users that it can service concurrently owing to the hardware overheads and costs of queries. Also, a scalability plan must be devised to ensure that the proposed model can be extended into a larger system.
Keeping the information current: One of the most critical factors for this proposal is to keep the information up to date, which involves ensuring that if any new paper is published in a category, it is appropriately added to the knowledge database and the interested parties are notified of its addition by e-mail or text message.
1.3 Expected contribution/significance to knowledge
Our solution would provide two functionalities which are available currently, but not together. While Google Scholar provides a detailed search and watch functionality it doesn't order research papers categorically which is a major drawback for serious research enthusiasts. CS Structure provides categorization capabilities, but it is quite limited in its scope. It doesn't have a detailed hierarchical structure which would enable precise searches. It provides a very broad classification, based only on the subject area of interest. For example, a research paper entitled "Nanorobotic Surgery" would be categorized in the section "Computer Science", whereas in our proposed solution there would be a hierarchy of research classifications leading to "Nanorobotics". This results in more precise search results and avoids presenting the user with papers which do not pertain to his field of interest.
One of the most obvious advantages of this scheme is that researchers would quickly find papers conforming to their area of interest and spend less time browsing through papers with little or no interest to them. The dynamic alert functionality makes it even easier to keep track of the most recent progress made.
2.1 Previous research
Google Scholar provides the ability to search for matching articles for subjects based on locating articles that refer any prior published articles. It helps establish a link among the authors who cite articles in some specific area and also identifies a pattern in which others cite a specific article. The research conducted involved comparing the citation counts provided by Web Science and Google Scholar for articles in the field of "Webometrics." . Based on the results, the following shortcomings of Google Scholar were highlighted:
There is no subject indexing and/or classification access - searching is by keyword in the journal title, article title, abstract, or text .
It also provides an incorrect count of citations because Google Scholar inherently does not have classification taxonomy .
CiteSeer, on the other hand, capitalizes on two broad lines of prior research. First is work done in the area of web technologies, user-interface and assistant software agents. The second line investigates into semantic difference between text documents so that "agents can simulate a user's concept of document similarity" . Citation indexing is a good example for identifying the semantic difference between text documents, which "records published research-citations of and by other publications" . However, there are certain shortcomings which pose as candidates for future work such as:
If a new paper is similar enough to a chosen paper of interest, then CiteSeer could notify the user of potentially interesting new research by e-mail.
Another direction for future work is the collection of database statistics. For example, the number of times a paper, author, or journal is cited may give some indication of its influence in the academic community .
Web Science is designed on the model of cognitive science. It can be defined as a "science of decentralized information systems" ; utilizing a plethora of emerging technologies, like the semantic web, ontologies, web services, and web-scale computing. The importance of traditional computer science topics like graph theoretic models, network structure analyses, and search algorithms, that are likely to be familiar to computer scientists, increases manifold in such a scenario. However, web science advocates make clear that understanding Google's technology and business success requires more than a discussion of web crawling and distributed search algorithms . This technique involves extracting the 'socially embedded' data out of research, leading to a broadening of scope, which would also take into consideration factors such as trust, reputation, privacy, governance, copyright, and network communications standards . However, this design is still in a nascent stage and this technique has been debated for long now. It has the following shortcomings:
The dependence of this technique on semantic web has slowed down its evolution as there are still open issues as far as dynamic discovery in semantic web is concerned.
The feature of e-mail alerts is still far away from being implemented as there are other core fundamental issues to work on.
Our approach would involve text categorization algorithms which determine the set of categories to apply to a text. In statistics, logistic regression is used for prediction of the probability of occurrence of an event . To accomplish this, it makes use of a number of predictor variables that may be numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex and body mass index. Bayesian inference is a method of statistical inference in which the probability that a hypothesis may be true is constantly updated with new observations. The term "Bayesian" comes from its use of the Bayes' theorem in the calculation process . We intend to use Bayesian logistic regression to tackle text categorization, treating research papers as atomic units converted to feature vectors. A feature vector is an n-dimensional vector of numerical features that represent some object and is used in machine learning and pattern recognition.
Bayesian logistic regression has been widely used to solve a number of problems ranging from sorting email to cancer classification and prediction.
2.2 Unanswered questions
The Bayesian logistic regression overcomes the shortcomings of analyzing high dimensional data, such as natural language text which poses computational and statistical challenges. The Bayesian approach can efficiently categorize text by applying a Laplace prior to produce sparse predictive models for text data. However, there is a limitation in using the lasso as a feature selection algorithm, the limitation being scalability. Scaling the algorithm to a larger application is a challenge currently. Therefore, the size of the repository of research papers that this system can handle effectively is unanswered.
The process of achieving a system that meets the requirements stated above can be divided into 3 levels. The first level is the knowledge gathering process which involves identifying our data sources. The second level is developing the classification schema for our repository. The final level is creating the interface which would have the functionality for users to register their email ids to receive notification of events.
From the tasks mentioned above, we can formulate a 4-step approach as follows:
Knowledge gathering: We need to build a repository of articles (papers, journals, patents etc.) from amongst a selected set of trusted sources of information.
Classification schema: Once we have our data repository set up, we need a means to classify it. This can be done using WEKA which is a popular data mining tool.
Interface creation: A means for the user to browse through the various artefacts which have been classified appropriately to facilitate the search process.
Mail alerts: Provide the users with the ability to subscribe to e-mail alerts which would notify them of any new events.
3.2 Data Sources
Data collection is the first step discussed in the previous section. We need to identify sources and create a repository based on information mined from these sources. The sources of information can be identified as follows:
Since these two conferences are the most popular and respected in the industry today, we can use these as data sources for the tool which we propose to design. All the data collected would be stored in a data store which could range from a full-fledged database to a CSV file depending on the application and number of users.
3.3 Plan for interpreting results
a) Once data has been collected, it needs to be interpreted by applying some classification algorithm and categorizing the data based on the informational value. With WEKA, we could import the data set into our WEKA explorer and apply a classification algorithm from among a variety of choices such as:
LVQ (Learning Vector Quantization)
SOM (Self organizing Map)
AIRS (Artificial Immune Recognition System)
CLONALG (Clonal Selection Algorithm)
Among these algorithms, LVQ is the best approach for pattern classification which is our objective. According to Morteza, "The network has three layers: an input layer, a Kohonen classification layer, and a competitive output layer. The network is given by prototypes W = (w(i),...,w(n)). It changes the weights of the network in order to classify the data correctly. For each data point, the prototype (neuron) that is closest to it is determined (called the winner neuron). The weights of the connections to this neuron are then adapted, i.e. made closer if it correctly classifies the data point or made less similar if it incorrectly classifies it" . LVQ creates prototypes that are easy to understand for the users.
LVQ can be very effective in classifying text documents and WEKA also has a plug in for the LVQ algorithm. Hence we will classify the dataset into categories and subcategories based on the LVQ algorithm.
b) The interface creation is the next process after categorizing the data so that the customer can view it in a lean and efficient way. This can also be achieved by creating a simple web interface with WEKA functionality embedded in it. The GUI chooser (WEKA's graphical start point) is a very good interfacing tool to display the results to the user. It gives the user the opportunity to conduct two types of searches namely query-based and graphic-based. The two techniques that handle query based and graphics based are explained below. Figures 1 and 2 provide a representation of the two approaches discussed.
SQL viewer: allows user-entered SQL to be run against a database and the results previewed. This user interface is also used in the Explorer to extract data from a database when the "Open DB" button is pressed.
Bayes network editor: provides a graphical environment for constructing, editing and visualizing Bayesian network classifiers
Fig 1. The SQL viewer 
Fig 2. The Bayesian network editor 
The expected end result is that the users can easily access and search the items in the repository. The searching, as explained before can be done based on an SQL or graphical approach.
For the e-mail alerts, we can embed the WEKA tool feature in a java based application which will have an option for the user to select favourite categories and register for alerts on those categories. Once the user is registered, the application will alert the user by mail whenever a new event occurs (by way of new entries added or removed from the dataset). Figure 3 shows a mock up of how the proposed system would look like once it has been implemented having both the organized search results as well as the option to sign up for mail alerts.
Fig 3. A mock up of how the proposed system would include
Schedule and Budget
Proposed Schedule (days)
Initial research and setup
Purchase required software licences, hardware and recruit appropriate project team.
Build a data-store
Involves repository creation and organization.
Setup and configure WEKA
Configure WEKA to use proposed classification algorithm ( LVQ)
Create GUI using JAVA and integrate functionalities provided by WEKA.
Performance and regression testing for 10000 simultaneous users.
Must be performed frequently during the initial stages of launch for optimal results.
Proposed amounts are volatile.
Table 1. Proposed Schedule and Budget
Primary Research paper
Authors: Alexander Genkin, David Lewis and David Madigan
Title: Large-scale Bayesian logistic regression for text categorization
Issue: 0040 - 1706
Pages: 291 - 304
Logistic regression analysis of high-dimensional data, such as natural language text, poses computational and statistical challenges. Maximum likelihood estimation often fails in these applications. This paper presents a simple Bayesian logistic regression approach that uses a Laplace prior to avoid over fitting and produces sparse predictive models for text data. It applies this approach to a range of document classification problems and show that it produces compact predictive models at least as effective as those produced by support vector machine classifiers or ridge logistic regression combined with feature selection. We describe our model fitting algorithm, our open source implementations (BBR and BMR), and experimental results.
This paper describes an application of Bayesian logistic regression to text categorization. In particular it examines so-called "sparse" Bayesian models that simultaneously select features and provide shrinkage presenting an optimization algorithm for efficient fitting of these models with 10's of thousands of predictors, and provide empirical evidence that these models give good predictive performance while offering significant computational advantages.
Number of citations according to Google Scholar
This paper has 166 citations back to the reference according to Google Scholar.
Conference or journal
The paper, 'Large-scale Bayesian logistic regression for text categorization'  by Alexander Genkin, David Lewis and David Madigan is a peer-reviewed conference paper which was published in the Technometrics journal in the August of 2007.
Primary or secondary
The paper presents an approach using the Laplace prior to produce predictive models for text data. It also describes the application of Bayesian logistic regression to text categorization and presents an optimization algorithm for efficient fitting of these models. Hence, this paper is a 'primary' or 'field' research paper as opposed to a 'secondary' or 'desk' research paper.