Data Mining In The Technology Sector Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data mining is what tech industry is getting into. Companies have billions of data points and looking for a means to convert it into revenue. Data mining includes dozens of techniques to collect the information and data and convert it into something from which everyone can benefit. This chapter will introduce you to the report, what all is presented here and what all results were derived.

A Network Intrusion Detection System (NIDS) is an intrusion detection system which tries to capture the malicious packets such as denial of service attacks, port scans or event hacks into computers but monitoring the network traffic sense. A NIDS senses all the inward coming packets and try to classify them based on some rules or signatures. For example, there are users who always log into their system in day timings and suddenly some of them access their system in late night; this is considered as suspicious and needs to be checked. This work is done by NIDS. Another common example is of the port scan. If there is large FTP connection requests to various ports, it can be easily assumed that someone might be trying to do the port scan of the network. NIDS is also used to detect incoming shellcodes.

NIDS not only inspects the incoming traffic but can also monitor the outgoing traffic, leading to valuable intrusion information. Some attacks on the network might involve hacking into the network using the inside computer; and therefore will never be considered as network attack. Thus to prevent such attacks we need NIDS to monitor the information from inside the network also, not only incoming information.

Mostly, network intrusion detection systems are used with other systems as well. They can be used in conjunction with firewalls, spam filters, anti-viruses etc. They can be used to update IP blacklist of some firewalls. Also they can be used to log information into database of user's choice. They can give regular alerts at time of intrusion detection via email, sound sources etc. Again all this can be programmed by the administrator of the network.

This report will mainly cover the topic of Data Mining. What is data Mining, how we can do it, how we can use it and other such aspects. It will also focus on Network Intrusion Detection Systems, Snort, how to install it, how to configure the snort to get it running and how to get the output of snort into the kiwi syslog server.

2. Data mining with WEKA, Part 1: Introduction and Regression

2.1 Introduction

First of all we should understand the concept of data mining in a layman language. Anyone in today's world may wonder what are companies like Google, Yahoo! etc are doing with the billions and billions of data points they have generated about their users. What are their plans regarding all this information. It is of no surprises to know that Wal-Mart is one of the most advanced companies that apply the concepts of data-mining to get significant results on their business. In today's world virtually every company in the world is using data-mining to advance their businesses, and if they don't do so, they will soon find themselves in great disadvantages.

So, how can one use the power of data-mining to enhance their business?

This chapter will answer all the initial doubts or questions you might be having about data-mining. It will give a significant amount of exposure on Waikato Environment for Knowledge Analysis (WEKA), which is free and open source software. This software can be used to mine data, and turn what you know in numbers to information to increase your revenues. One might think that expert systems are required to do data-mining but it is not so. After this chapter you will see for yourself that you have learned a pretty-good job of data-mining.

This chapter will discuss the first and easiest technique for data-mining, which is Regression. It transforms data/information in a numerical prediction for future data. It is so simple that you might have already encountered such things earlier in the prevailing spreadsheet software in your market, although WEKA can do much more complex calculations to help you. In future chapters, other methods like clustering, nearest neighbor, classification trees etc. will be touched upon. (If these terms frighten you, don't be, all will discussed as we progress.)

2.2 What is data mining?

Now, let's shift our focus to core concepts of data-mining. What is data-mining? It is the conversion of large chunks of data into meaningful patterns. Data-mining is of two types: Directed and Undirected. In directed data-mining, we have to predict a particular data point, as in our example is price of the house we need to sell; given the prices of other houses in the nearby areas. In undirected data-mining we try to create different groups of data, or find patterns between them. Examples include doing data-mining on consensus information, country populations, trying to undercover trends in living style, food habits etc.

The data-mining as we see it started in mid nineties. This was only because the power of computing increased and cost of computations and storage reached a point when companies did have to hire outside powerhouses. They could buy the equipment easily and do in house data-mining.

The term data-mining refers to dozens of data, techniques and various procedures to examine data and turn into something useful. So, this report will only touch the surface of the techniques of data-mining. Experts in the field spend 20-30 years for the same. And, they might give you an impression that this is something which only big companies can afford.

This report hopes to throw light on many of the misconception about data-mining, and I will try to make things as clear as possible. It is not as simple as using a formula in an excel sheet, but it's not so difficult that you cannot manage it yourself. This brings me to 90/10 model. You can achieve 80/20 model easily but to push yourself to the 90/10 model you have to get into the depths of the subject. To bridge the distance between two models it will take you about 20 years. So, until and unless you have decided to take it as your career, "good enough" is what you need. Also it will be better than what you are using right now, so there's no harm in good enough way!

End result of a data-mining model is to develop a model, a model that can improve and suggest new ways to interpret and know your existing data and the data you still have to come across. Because there are many methods to go about the whole data-mining thing, first you have to choose what model you would be using for your data, what model will fulfill your data-mining needs to the best. Although, this will come with guidance, experience and practice. After reading this report you should be able to look at your data and say "Yes this is the right model for my data set." You will be able to make "good enough" models out of your data.

2.3 WEKA

Data-mining isn't the area of big companies and expensive software systems. In fact there is an open source, freely available software that does same things as that expensive software. This is known as WEKA. WEKA is developed by University of Waikato (New Zealand) and was first used in its modern form in 1997. It uses the General Public License (GPL) under GNU. The code for this software is written in Java and contains graphical user interface for interacting with data files and gives tables, curves or graphs as the visual output. It also has the API support so it can be easily embedded into other applications, such as automated server side data-mining tasks.

Now please go ahead and install WEKA on your system. Its Java based so if you don't have JRE installed on your machine, please download the WEKA version that has JRE as well. When you start the setup you will the window as shown below.

Figure1. WEKA startup screen

When the WEKA is started, the graphic user interface window pops up and offers you four ways to work with WEKA and your data set. For examples discussed in this chapter choose only the explorer option. This option will be more than sufficient for what is discussed in this chapter.

Figure2. WEKA Explorer

Now that you have learned how to install and start up WEKA, we will start our first data-mining model that is Regression.

2.4 Regression

Regression is most probably the least powerful techniques to mine your data, but is the easiest one to use; no wonder these two things go hand in hand simultaneously. This can be viewed as an easy one input variable and one output variable; called a scatter diagram in Microsoft Excel or X-Y diagram in It can easily be made complicated by introducing number of input variables. In the regression models, all approximately fit the same general patter. There are a number of independent variables which are available, and using them the model gets you a dependent output variable. This model is used to predict the results given the values of all the independent variables.

Regression model is new to no-one. Everyone has probably seen or even used the model before, and may be even created the same in mind. We will be discussing an example of house pricing in similar locations. The price of the house here is the dependent variable, which is dependent on a number of independent variables. These independent variables include size of the lot, the square footage of the house; bathrooms are upgraded, whether granite is in the kitchen etc. If you have ever bought or sold any house, it is likely you would have created such model already in your mind before doing so. You must have created the model comparing your house to other houses in the same locality and the prices they have been sold for. You create a model for already sold out houses and then put parameters for your house into the same model and get the likely price of the house.

Let us now create some real model based on the assumed data. Let us assume the data below is the real data for my neighborhood, and I am trying to find the prices for my house. This output can also be used for property tax assessment.

Regression model barely scratches the surface of the data-mining; this can be good or the bad news for you depending on the usage and your perspective. There are complete college semesters dedicating to this, and they might also teach you what you don't wish to know. But, these scratches we are leaning are good enough for the WEKA use within this chapter. If you have continued interest in WEKA, data-mining or the statistical models you can search for terms like "normal distribution", "R-squared and P values" in your favorite search engine.

2.5 Building the data set for WEKA

Loading data into WEKA requires putting up data into a format WEKA and understand. This preferred format is ARFF (Attribute-Relation File Format), where user can type in the type of data and the data. In this file one has to give the name of columns being used in the data and what type of data each column will contain. This can be an integer value, float value, date or a string. But in case of regression these types are limited to the numeric value or the date values. After this the real data is supplied in this file. There are rows of data in a comma-delimited format. The ARFF file we will be discussing in shown below. Notice that in the data set we have not included the dependent variable for the house, of which we want to know the price. Since it is data input, we will not be entering the data for the house whose selling price is unknown.

Table2: WEKA File format

2.6 Loading the data into WEKA

After we have created the data file, we have to create the model we will use; in this case it is the regression model. Start WEKA and choose the "Explorer" interface. You will be taken to the Explorer screen, shown under the preprocessors tab. Click on Open File and select the ARFF file which you must have created earlier. After selecting you should see something similar to as shown below.

Figure3. WEKA with house data loaded

In the explorer view, WEKA allows the user to review the data that is being worked upon. In the left section of this window, it shows all the columns (attributes) that are present in our data and also the number of rows of the supplied data. When you select the any column the right section shows the information about the data of that column present in the data set. For example, click on the houseSize column in the left part of the window, the right part will now show you the additional statistical information about the size of the houses. Maximum value, which is 4032 square feet, is shown along with the minimum and average values. Standard deviation of 655 square feet is also calculated and shown along with the above information. If you don't know what is standard deviation doing worry, it is the statistical measure of the variance. Not only this, there is also a visual tool available to examine the data you have entered. Click on Visualize All button. (Due to the small size of our data set, the visualization is not as powerful it should be in big data sets with hundreds of thousands of rows.)

Enough looking at the data, now let's move on to creating the data model and a price of my house!

2.7 Creating the regression model with WEKA

Click on the Classify tab to start creating the data model. Firstly select the data model you want to build, so WEKA now knows what type of data it has to work with and how to create the appropriate model.

Expand the functions branch, after clicking the Choose button

LinearRegressing leaf is selected

By now the WEKA already knows that we are building a regression model. One can clearly look through that there are a number of other options also … lots of other model! This will tell you that we are really only touching the surface. Please note that there is also another option as SimpleLinearRegression in the same branch. Please make sure you do not choose this leaf as this model looks onto only one variable, and in our data set we have six variables. When you have done all this, you get a screen like the one shown below in the figure 4.

Figure4. Linear regression model in WEKA

Is all this possible in a spreadsheet?

It has no and yes both as answer. Short answer is no and long answer is yes!

Most of the popular spreadsheet software currently present in the market cannot easily do what we just did. However, if you are not doing data-mining on multiple variables, and you are concerned with only one variable, that is SimpleLinearRegression it can be done. Don't feel so brave at this point, it can do regression with multiple variables, but it would be too difficult and confusing and definitely as easy as doing it with WEKA.

At this point our desired data model has been chosen. Now we have to tell WEKA where the data is present for building this model. It might be obvious to you that we have already provided the ARFF file, but there are actually different options. The options we will be using are more advanced. The three other options present are

Supplied test set : Here we can supply different sets of data to build the model

Cross-validation: This option lets WEKA build a model out of the subsets of the supplied data and then takes the average out of them to create the final mode.

Percentage split: In this option WEKA takes a percentile of the subset of the supplied data set, and builds a final mode.

Actually these three choices are useful with other data models, which we will see in future chapters. With regression model we can simply choose the training set. This tells WEKA that we want to use the data we supplied in the ARFF file to build our data model.

Last step in creating our data model is to select the dependent variable, which is the column we are looking for prediction. We know that this column in the selling price, but we have to tell this to WEKA too. There is a combo box, right below the test options which lets us choose the dependent variable for our data model. Although the col.umn sellingPrice is selected by default, if not so, please select it.

After all this, click on Start. The figure below shows what the output should look like.

Figure5. House price regression model in WEKA

2.8 Interpreting the regression model

WEKA shows the regression model in the output too, and does not mess around.

This is shown clearly in Table 3.

Table 4 displays the result, which is the selling prices of my house, after putting in the values of the independent variables for my house.

However, if we look back to the beginning of the topic of data-mining, we would notice that data-mining is not about giving a number as the output, but it is about identifying patterns in the data and different other rules that can be formulated. It is not used to get a number but rather to develop or create a data model that helps in prediction of different patterns, detect various other parameters and help us come up with definitive results. Now let us interpret the results as shown in the output window, apart from looking at the selling price of the value. Let us look at the formula used for getting this selling price.

The Granite does not matter

To statistically contribute to the accuracy of the model, WEKA only uses the columns that add to the accuracy of the model created. The columns that deplete this accuracy are not used. This regression model tells us that whether granite is present or not, it does not contribute to the selling price of the house.

Bathrooms also do not matter

In this column we have used a simple value of zero or one. Now we use the coefficient we get from the regression model formula to know how the value of upgraded bathroom affects the value of the house overall. The model being discussed tells us that it adds $42,292 to the house value.

Bigger the house, lesser the value

Our model tells us that if the house has larger area, then it will be having lower selling price. This is clearly visible from the negative sign before the houseSize variable. The formula tells that $26 is reduced from the house value for each additional square foot of area. But this makes no sense at all. So, what is the correct interpretation for this fact? The size of the house is not the only independent variable on which the house value is dependent. It is related to the number of bedrooms in the house, because bigger the house, more bedrooms it should have. This clearly indicates that our model is not perfect. But this is not something that cannot be fixed. In the preprocessors tab we can easily remove any column from the data set which we do not want to be contributing to our data model.

Now let us consider another example from the official WEKA website. This is more complex than our example of little number of houses. This example strives to tell what will be the miles per gallon for the given car given the various other parameters. These parameters will range from displacement of engine to the horsepower it produces. Also, how many cylinders does the engine has, how much does the car weigh, what is its acceleration, model and make of the car, what is its production year, country of origin etc. Not only are these, to complicate our model, there about four hundred rows in the given data set. Yes, in theory this looks all complex, but WEKA has no problem handling such data.

To produce the data model for this set of data you have to follow the same steps as shown above for the house example. So I will not again tell you the same steps, and will directly discuss the output you get after creating the model.

The output table is shown below.

When you run the model for the above example you must have noticed that WEKA does not take even a second to compute the model. So computationally, it not a problem for WEKA to create a powerful and useful regression model for huge amount of data. Also, you this model might seem too complex to you as compared to the house example. But it is not. Let see how to interpret the model formed. Let us take first line of the model, -2.2744 * cylinders=6,3,5,4, it means that that if the vehicle has three, four, five or six cylinders you would place 1 in this column, otherwise for any other value it will be a zero. This is made clearer by an example. Consider data set row number 10, and put in the numbers from this row into the data model. After this you will see that see the output from the regression model approximately matches the output given to us in the data set!

Table6. Example MPG data

You can try the same thing with any other data set row also. So, what does this means? This means that our data model is performing well, and predicts a close output of 14.2 miles per gallon when the actual miles per gallon in 15. We can be assured we will get an approximate correct value for the data whose output dependent variable is not known.

3. Data mining with WEKA, Part 2: Classification and clustering

3.1 Introduction

In the previous chapter, concept of data-mining was introduced. Also I made you familiar with the WEKA software, which is open source and free to use. It helps you to mine your own data without help of an outsider. I also discussed about the first model of datamining and probably the easiest one: Regression. This allows you to predict a numerical value based on the values of the dependent variables. This is the least powerful data-mining algorithm. It shows a good example of how the raw data can be converted to useful information to be used for future purposes.

In this chapter we will be discussing about two more additional algorithms of data-mining that a bit more complex than the method discussed in previous chapter. This directly comes from the fact that they are more powerful than the previous one and help you interpret your data in different ways. Also I have said earlier, the key to using the power of data-mining is to know which model you have to use to mine the data you have. If the right model is not used, it will be nothing more than garbage! We all see on various sites, that the customers who bought this, or who viewed this also bought or viewed these articles or items. There is no numerical value associated with this kind of data. So now lets learn dig into the other model you can use for your data.

In this chapter I have also included portions about the nearest neighbor method but we will not be going into details about this method. I have included this to complete the comparisons I want to highlight on.

3.2 Classification vs. Clustering vs. nearest neighbor

I think we should try to understand what each model strives to accomplish before going into the intricacies of any model and practically running those models on WEKA. What type of data and the goals are addressed by each model. Let us now get back to our first data model - regression, so you can relate the new models to the model we already know of. Here we will be using a practical world example to show how each model is different from each other and how it can be of any use to us. All of my examples will be about a local BMW dealer, who wants to increase its sales. The store has all the information about people who have bought a BMW or even had a look at it and have gone through the BMW showroom. Using the data-mining of the available dealership wants to increase the current business.

3.2.1 Regression

Question here is how much the dealer charge for a new BMW M6. Regression model that we have already studied can easily answer this question and give a numerical output based on the formula derived in the model. It will use information of the past sales of the M6 to determine how much dealer had been charging for the previous cars, what were the features available on those cars. The model will then ask the dealer to put the details of the new car he is willing to sell and give him the selling price.

For example: Selling price is $25k + $2.9k multiplied by liters in engine + $9k if it a sedan + $11k if it is a convertible + $100 multiplied by the length of the car in inches + $22k if the car is convertible

3.2.2 Classification

But now the question of concern is "What are the chances that any given person X will buy the BMW M6?" Such questions can be answered by creating a classification tree. It will tell us what are the chances of any person buying a BMW M6. There can be various nodes on this classification tree were talking about. Some of such nodes may be age of the person in question, his annual income, his gender, what all cars he currently has, number of kids, whether he own a home or he rent a place etc. These attributes can be used in a classification to know the likeliness of him buying the new car.

3.2.3 Clustering

In this aspect is what age group of people likes to have a BMW M6. Again data-mining can be applied to get the answer. We already have the data of past customers, what is their age. From this data group we can infer using our data model, whether any particular age group has higher probability of buying a new BMW M6, whether they are likely to order blue color BMW. Also it can be determined what colors are likely to be ordered by people of other age group. In short, data when mined will cluster for different age groups, and different colored cars helping you to easily determine the pattern between them.

3.2.4 Nearest neighbor

Question here is, when people purchase a new BMW M6, what are the other features or optional things they like to buy with it? Data-mining can be applied here to know the trends of other things purchased, which might include matching hand luggage or a matching colored wrist watch etc. Using this information the car dealer can make specialized promotional packages of the items the people tend to buy along with. This will help the dealership increase its sales. Also dealer can offer discounts on these "other" items.

3.3 Classification

Classification is an algorithm used for data-mining that will make a stepwise guide, which will be used to determine the output of the model. It is also known as decision trees or classification trees. The created tree has nodes, which represent a decision spot, i.e. a decision has to be made at a node before moving further. This has to be done until and unless you have reached a leaf node, which is the end node in the tree and has no children. It might sound confusing to you but actually its simple and straight forward, as shown in the table below.

Now let us see what is actually understood by this example. At the root node, or the first node there is a question which asks you whether you will read this section or not, and goes to the answer based on your option. Next if you have chosen yes you will be asked whether you will understand it, or if you have answered no, the leaf node is there which says you will not learn it. The main advantage of this classification tree is that you don't need a lot of information on the data to make this tree structure that is normally correct and informative.

The regression model and the classification have a similar concept of "training set" to produce the data model. The data set of known output values is taken and the model is built. Using this, the expected output is got for the input variables for which don't know the output. This is all similar to what we have done and seen in the previous regression model. But, this model needs an extra step to make it more helpful and precise. It is recommended that you put about 60 to 80 percent of the data rows into the data set for training purposes, which is then used for model building. Remaining values are used as testing set. We then immediately use this test set to check the accuracy of the model we have just created.

Now, you might be wondering why we are doing this extra step. This is done to overcome the problem of overfitting. If we create a very large data set, then a model which is exactly perfect for that data will be created, but only for that data. We will be using the model to make predictions in future too, and we want the model to work fine for that too. To overcome the problem of overfitting and to make sure that the efficiency of our model is not restricted to the test set data, we have divided this data set into two parts. We will see this practically further on.

Also we have to discuss one more major concept of classification trees, known as pruning. Pruning, as it is obvious from the name itself, it is the process of removing the classification tree's branches. Now you would wonder why we would like to remove some of the branches of the classification tree. Again, the reason here is overfitting. Trees become complex if the rows and columns in out data are very large. In theory number of leaves in a tree is multiplication of the number of rows and columns in our data. But again, it is of no use to us, as it will not be useful in future predictions, rather it will fit the present data perfectly. So, we want to create a balance. A tree with least nodes, making it the simplest tree is preferred, but we have to carefully manage the trade-off between this and accuracy. We will see it further.

Before starting the use of WEKA for this model, there is one last thing I want to put up before you; the concept of false positive and false negative. False +ve is a data point where out model has predicted that it is a positive value, but actually it is a negative value. Similarly, a false negative is a data point where out model has predicted that it is a negative value, but actually it is a positive value.

Our model is incorrectly classifying the data presented, clearly indicated by the errors discussed above. The designer of the model has to take into the account up to what percentage of errors is acceptable, because errors are always going to be there. The acceptance percentage will be dependent on the usage of the model you are creating. Let us consider the model is going to be used got monitoring heart rate in some hospital, obviously, percentage of error has to be very less. On the other hand, if you are creating the model to learn about data-mining (as you are doing now), the acceptance percentage of errors can be relatively high. Also the designer needs to define what percentage of false negative vs. the false positive can be accepted. Let us consider the emailing system. If a real e-mail is marked as spam (false +ve) can be extremely harmful as compared to the false -ve, that is a spam coming to your inbox. In this a ratio of 1000:1 may be acceptable, again depending on the needs.

We have looked enough on the background and other technical details of the classification trees; now let's jump on the real world problem, using the real world data set. Let us now put all this into WEKA.