Data Mining News Headlines Classification Tool Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Headline classification Tool is a web application that can be used to classify news headlines into different categories. Many of the existing news classification rely on entire news article content to classify a news article into a specific category. We intend to build an automatic classification and analysis tool that uses just the content in the news headline for classification. Headlines from different news sources will be classified into various pre-defined categories using supervised learning algorithms. We also intend evaluate the effectiveness of our tool by measuring its accuracy.

Headline classification tool is used to classify headlines into different categories. In order to perform this task, we will first use a training data set with content based on four different classes namely - technology, health, sports, and politics. J48 and Naïve Bayes algorithms will be the two supervised learning algorithms that will be used to build the classifier model.

After the model is trained using either one of these algorithms, another set of data will be used to test the model and classification will be generated. This tool will also be used to evaluate the test results and the accuracy of the tool. For evaluation, we can upload a single headline or a file with multiple headlines, each of which can be of different category. The system will classify these headlines accordingly into one of the four categories. It will provide a comparison between the class predicted by the user and the one predicted by the classifier. The class distribution of the classification will be represented by a pie-chart generated using Google Charts API. Finally, we will evaluate the classifier based on its accuracy, precision, and recall.

It is assumed that the user uploads relevant training data to build the model and assigns the class for the training data appropriately. The tool provides very basic error handling mechanisms. Detailed error handling is beyond the scope of this project.

3. Requirement Specification

The following section describes the functional requirements of the system.

Store Training Data

The system shall allow users to upload training data stored in a file. A training data file can contain multiple headlines with each line in the file representing a single news headline. The system shall allow the users to select a class corresponding to the data in the file. The system shall allow the user to select only from one of the four pre-defined classes, namely, Politics, Sports, Technology, and Health. The system will store the training data and its corresponding class in the database. This training data will be used to build the model that will be used for classifying the unseen test data.

Generate Model

The system shall allow generation of the classifier model based on one of the two supervised learning algorithms. The two major classification algorithms used by the system are J48 and Naïve Bayesian. Once the user selects desired algorithm, the system retrieves the training data from the database and builds the model. The model and the training data upon which it is built is again stored in the database. It is important to store the training data corresponding to a model since any change in training data will affect the model.

Classify Test Data

The system shall allow users to classify a single news headline (test data) into one of the four predefined categories. When the user inputs the news headline, the system retrieves the latest generated model and its corresponding training data from the database. The system uses this information to classify the unseen test instance. The system also calculates the class distribution to give the user an understanding of how the test data is classified. This class distribution shall be represented in the form of a pie-chart indicating the percentage distribution per class.

Model Evaluation

The system shall allow users to evaluate the model by determining the model's accuracy, precision and recall. This evaluation is performed using bulk test data. The user uploads a file containing multiple news headlines with each line of the file representing a single headline (test instance). The system reads this file and then requests the user to predict a class corresponding to each headline. Subsequently, the system also determines the class of each test instance. The system then compares the class predictions of user and the classifier to determine the accuracy of the model. Class predictions per class are also recorded to calculate the precision and recall per class.

4. Classification Algorithms and Pre-defined Classes

Supervised Learning Algorithms

Applications do not have 'experiences'. They learn from data which are past experiences of application domain. This task is normally called supervised learning or inductive learning. In other words, in supervised learning, supervise or train the system before, in order to get classification of data. Supervised learning is a kind of a machine learning task in which training data is used to produce a classifier. The training data consists of an input object and a desired class value. The supervised learning algorithm uses the information contained in the training data to produce an "inferred function" also known as a "classifier". This classifier is then used to predict the class value of unseen test instances.

In this project, we use two supervised learning algorithms, namely, J48 and Naive Bayes. J48 is an implementation of C4.5decision tree learning which in turn produces decision tree models. In order to separate the possible predictions it recursively splits a data set according to tests on attribute values. This algorithm uses greedy approach to induce decision trees for classification. In this algorithm tree is constructed in a top-down recursive manner. All the training data are at the root initially. Examples are partitioned recursively based on selected attributes. As most of the supervised learning algorithms are built on analyzing training data and model is used to classify the trained data, so is J48 algorithm. In this decision-tree model is built by analyzing training data and then model is used to classify trained data. It generates decision trees and the node in the tree evaluates the existence and significance of every individual feature.

Naive Bayes algorithm is based on Bayesian theorem. It is sometimes also termed as Probabilistic learner. It uses all the attributes contained in data and then analyses each of them equally as each of this instance is independent of each other. Naive Bayes computes conditional probabilities of the classes mentioned in the instance and then it picks the class which has got highest posterior. Naive Bayes classifiers can be trained very efficiently in a supervised learning mode. Naive Bayes classification treats each document as "bag of words". The model generated assumes words of a document are generated independently of context given the class label. The probability of a word is independent of its position in the document.

Pre-defined Categories for Classification

There are lot of categories in any of the newspaper whether its paper or online. For this application we have picked up most the common categories which are health, politics, technology and sports. Sometimes some of news websites add sections which are the most trending topics for example in first week of November 2010, the most trending section was election. So this section was added some of the top news websites. In our project we are not concentrating on these trending topics. It will be interesting to work with these classes as they are somewhat related to each other. For instance, on if you select the section which says health, then on health category web page at the top it says 'Related Topic' where technology or healthcare categories are also mentioned. As these categories are related to each other but tool is able to classify them correctly to a certain extent. That is why these categories will help evaluate the model better.

5. System Architecture

Tomcat Servlet Container


Classification Algorithm

Model Evaluation

Unseen Instance


Training Data


Browser- based Client

Classifier (Model)

Figure 1: Headline Classification Application - System Architecture

The system consists of three-tier architecture. The browser is the user interface in the system. The user can upload the training data, choose the classification algorithm, and upload test data via the user interface. The web application, deployed in a servlet container, forms the core of the system. The application contains the implementation for parsing the training and test data and the classification algorithms. The data tier consists of the database used to store the training data and model.

We use Apache Tomcat as the servlet container and MySQL as the database in the system.

6. Database Schema

The database for the application consists of two tables, namely: TrainingData and Model

TrainingData table schema

Sample data in TrainingData table

Model table schema

7. Implementation

Headline classification system is implemented as a web application. Users can access this application from a web browser and the application itself is deployed in a servlet container. The application is developed using Apache Struts, which is an open source web application development framework. The application mainly uses WEKA Java APIs for all the tasks related to model generation and classification. Specifically, WEKA APIs are used to:

Transform the raw training and test data into a standard format that can used for classification.

Clean the training data for stop words and case conversion.

Classification algorithm implementation.

Model Evaluation.

The implementation main modules of the web application are explained below and the corresponding classes in the source code are also indicated in bold text.

Training Data Storage (

Raw training data in the form of plain text file is uploaded by the user. The user also provides the class to which the training data belongs. This file contains multiple lines of news headlines. The application parses each line in the file, and stores each line as a separate record in the database along with its associated class name.

Classification Model Generation (

Generating the model involves two major steps:

Pre-process the training data

The raw training data cannot be used as-is for classification. This data has to be first transformed from raw text (Nominal attributes) to String data type. This is done using the WEKA's NominalToString filter. Next, the String data has to be converted into a vector of numeric attributes. This is done using WEKA's StringToWordVector filter. Before conversion, the String data is converted to lowercase and stop words are removed to enhance the quality of training data.

Generate the model

The model is generated using WEKA APIs based on the classification algorithm specified by the user. WEKA provides implementation for J48 and Naive Bayes algorithm and we have used these algorithms to generate the model. Once the model is generated, the Byte stream of the model as well as a training data corresponding to this model is stored in the database. This information is required during the Classification and Evaluation steps.

Test Data Classification (

For classifying the unseen test instance provided by the user, the model's byte stream is first retrieved from the database and is cast back into a Classfier object. Similarly, the byte stream of the training data used to build the model is also converted back into an Instance object. These two objects are then used to classify the test instance.

Model Evaluation (

To evaluate the model, test data is uploaded in bulk. That is, a file containing multiple test instances are uploaded by the user. It is necessary that the user be aware of the correct class of each test instance when the user is prompted to predict a class for each test instance. The system takes the input from the user, and classifies all the test instances. A comparison is then drawn between the class prediction made by the user and that made by the classifier. The system records the number of matching predictions. i.e the classifier predicted the same class as the user, as well as non-matching predictions. These two statistics are used to calculate the accuracy of the model. Similarly, the system records prediction data for each class to determine the precision and recall per class.

8. External Source Code Used


We reused the code to preprocess and train the model in the class

We also reused to convert serialize and deserialize Model and Training Data objects in, and

WEKA API (weka.jar)

Struts API (struts.jar)

9. Input Data Collection

Two major types of data are used for classification - Training Data and Test Data.

In order to achieve maximum accuracy while classifying the test data, it is necessary that the test data distribution be identical to that of the training data.

Training Data Collection

News headlines across a span of two months were collected from  to train the model. Four major categories of classification namely Health, Politics, Sports and Technology were identified and news headlines related to all these fields were stored in separate text files in the system. For training the classifier, the text files were uploaded individually.

Test Data Collection

Current day news headlines were collected as a part of the test data.

10. Challenges Faced

Initially we proposed 'Automatic Twitter feed Classification and Analysis Tool' as our project. The objective of that project was to extract information from Twitter feeds and categorize those in pre-defined classes using supervised learning algorithm. There are no tools available to categorize tweets into high level topics such as health, politics, sports etc. We intended to build such an automatic classification and analysis tool that will provide an insight into usage patterns on Twitter.  The features proposed for this project were:

Classify tweets into various pre-defined categories using a supervised learning algorithm

Determine the most tweeted category

Determine the most active users tweeting about a particular topic

Present a visual output of the observed patterns in a human understandable form.

Problems Encountered:

For creating model using training data, tweets from different categories were first collected as training data. This dataset was converted to arff (attribute relation file format) format so that it can be used with open source Java-based tool Weka. This arff file was fed to Weka tool using filter as 'StringToWordVector' with properties as

IDFTransform = True

lowerCaseTokens = True

Stopwords = (need to mention it for Windows OS)

Tokenizer = AlphabetTokenizer

For the training set, model was generated successfully with correct categorization of classes. But when similar format of data set was taken as test set, then the classification was not correct. Initially when feeding test data in Weka, there was always 'Test data and training data incompatible error' encountered. This problem was resolved by pre-processing test data and then using it for classification.

For classification, different classifiers were used like Naive-Bayes, NaiveBayesMultinomial etc. But our major effort for classification was using Naive-Bayes classifier. When using test data, the classification was not correct. Every time predicted class was different from the actual class.

We observed the reason for this was noisy data, i.e. tweets are just 140 characters long and consisted of special characters like hash tags (for trending topics), and lot of stop words. In many cases, a URL was provided as a part of a tweet without much meaningful information. We tried to remove as much noise from the data as possible. But in the end, we were left with very little data that made sense. As a result, the model generated was not capable of accurately classifying the test instances. We spent a lot of time debugging this issue. We then came up with the idea of using news headlines which are also short sentences like tweets, but most of the time, they were complete sentences and made sense on their own. We decided to reuse the application that we had built but only change the training and test data. During the process of testing headline classification, we observed that the classification quality was up to the mark as headlines were more meaningful as compared to tweets.

Below is the sample tweet data used for classification:

Training Data:

@relation _Users_prachi_Documents_SJSU_Fall2010_CMPE296M_twitter_train_dataset

@attribute text string

@attribute class {business,sports}


'Hoyer clyburn fight appears to be over.',business

'Colorado Sen. Michael Bennet turns down DSCC chairmanship.',business

'Sarah Palin\'s unfavorable rating reaches an all time high.',business

'Sens. John Cornyn and Mark Warner, Reps. James Clyburn and Heath Shuler, Anita Dunn and Tom Davis.',business

'Tiger one of four added to U.S.\' Ryder Cup roster.',sports

'Turkish team paid UK recruit over $100K.',sports

'\'s predictions for the 2010 season.',sports

'Major league saves leader Hoffman gets No. 600.',sports

Test Data:

Test Data

@attribute text string

@attribute class {business,sports}


'Sell-off on Wall Street - A sell-off in U.S. stocks picked up steam Friday afternoon, following a volatile trading s',business

'Does Ford have a Cadillac strategy for Lincoln? - Ford Motor Company CEO Alan Mulally is carefully sidestepping the',business

'Phillies win, regain NL East lead',sports

11. Classifier Evaluation

Classifier Evaluation helps us check the correctness of the model built. The measures precision, recall and accuracy are used to evaluate the classifier.

Accuracy =

Precision =

Recall =

These metrics are displayed in a tabular form for the user to check the accuracy of the system. The classifier was evaluated with different training and test sets.

Test Case



Small training data set with ten headlines in each category (Sports, Health, Technology and Politics) was chosen.

The test set had two headlines under each category.

The classifier evaluation was less than 50% accurate and 5 out of 8 test sets were incorrectly classified.

The system learns only from the training data. So the training data needs to be vast.

Training data set with varying number of data in each category



Health -55

Technology - 40


While testing the application, headlines from unknown categories like Business, Education, etc gets classified under the category that has the maximum number of training sets.

(In  this case -Health)

As classification is done based on bag of words from the training set, the category in the training set with the maximum number of headlines is chosen by default. This is a limitation of the classifier.

News headlines from similar categories.


Headlines from 'Science' and 'Elections' were added to the test set.

Headlines from 'Science' were classified under 'Technology' and Headlines from 'Election' were classified under 'Politics'

Choosing similar categories like 'Science' and 'Technology' or 'Politics' and 'Election' may result in wrong prediction by the classifier as the headlines in these categories overlap.

Each category in the training set had up to a hundred headlines.

The test data was increased to ten per category

The results were better than when compared to using few training sets and with increase in training data, there was a proportional increase in accuracy, precision and recall of the classifier.

With increase in training data, all the four categories (Health, Politics, Technology and Sports) which vastly differ from one another can be classified effectively.

Headlines that combine all the categories were used for testing to see how the classifier works.


Obama Gets Injured: Friendly Game Of Basketball Turns Into 12 Stitches.

Will the system classify it under Politics (President), Sports (Basketball) or Health (stitches)

The system classifies it into one of the three categories depending on the training data set.

The system classifies the test data based on the count. The number of times a word appears in the training data in a particular category is used for classification

12. Conclusion

The main idea in the beginning was to create "Automatic Twitter Feed Classification And Analysis Tool" where we aimed at collecting tweets from the Twitter website and classifying them into different categories. After weeks of experimenting, we found that training data was the core to building any classification model. As the tweets in Twitter data were very noisy, we were unable to achieve good accuracies in the test results.

A slight variation to the "Twitter Feed Tool" is the "News Headlines Classification Tool" where instead of collecting noisy tweets, we collected headlines from different newspapers and aimed at classifying them into their respective categories.

Even though a hundred percent accuracy was not obtained, this project aimed at classifying the top stories in newspapers into their respective categories based on headlines alone. This project gave us the understanding of two algorithms (J48 and Naive Bayesian). Experimenting with the algorithms using the same training and test sets helped us understand how differently they work. The accuracy of the model was mathematically displayed by calculating the precision and recall. It also gave us a very good understanding of text mining.

Testing with various training and test sets helped us understand that the automatic classification worked better with diverse categories which had different distinct words in their training data (headline) thus making it easy for the model to predict the category and achieving a higher accuracy

As a future work, intense testing can be performed with different training and test data to understand the functionality of the different models and aim at achieving better accuracies in classifying the headlines.

13. Application Screenshots

Upload training data

Train Model

Classify Test Instance

Evaluate Classifier

User Prediction

Classifier Prediction