This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
During 1990's the term data mining got introduced. Recently data mining is used widely in business, marketing etc. but how this term data mining got evolved? Why it has become a very popular word in the business? How to apply it in our practical day to day like? All this questions I am going to answer in precise form with the help of references in my report
Looking back in history data mining was coined as a mixture of three words:-
Classical statistics is the first one to get evolved. Without this there must have been no data mining. Classical statistics is used in regression analysis, standard distribution, standard deviation, standard variance, discriminated analysis, cluster analysis. They play an important role in statistical analyses. Thus without classical statics, data mining term wouldn't have existed.
The second one is artificial intelligence or AI. Artificial intelligence is about applying human thought to the statistical problem. But this was not much of a use those times because it required very high computer processing and those times the cost of computers were very high. Until 1980's when the computer prices were reasonable, AI was used in scientific/government market. AI concepts were used very few products such as query optimization modules for Relational Database Management Systems (RDBMS).
The last one is machine learning which the combination of both statistics and artificial learning. Since AI was not much of the use due to its cost, machine learning came into picture during 90's when the price of the computers was less. It was used more than AI, because it uses AI application with statistical analysis. Machine learning uses fundamental statistics concepts and AI algorithm in the businesses where it is used.
Thus data mining is the union of three things, statistics, AI, machine learning. These techniques are used to find the hidden pattern in the large information of data. This is how the term data mining got evolved. The below diagram gives us the clear picture of how data mining was evolved.
Data mining: combination of multiple disciplines
Figure 1.1: history of data mining
1.2 WHAT IS DATA MINING?
Data mining is the process of extracting patterns from huge data. It is the most powerful technology which helps companies to extract important information from their huge dataset. Data mining basically extracts information from the companies past records, correct if there are any loopholes and use these patterns to improve the future business. Due to this process, it can be implemented in the fast processing computers and can get the output in seconds. Thus it saves lots of time and with less amount of time big goals of the company can be achieved.
Many companies with help of software/hardware refine the data and then applies data mining algorithm to extract the patterns. Basically data mining algorithm is done in two steps; initially we extract the information from the unknown dataset and form patterns from it and then since already patterns of data is set from the history of company, the same data mining algorithm is used in the recent data and compare both to know what is the current behavior of the company and how much it has improved.
Sine this data mining process if fast, we don't bother how huge the information is? In fact if we have more information (more columns or more rows in the database) our accuracy level will be high and there will be less chance of errors. Thus it allows the company to make conclusions even from the smallest bit of the data available.
Data mining is used wide range of applications. Some of the applications I have listed below:
A pharmaceutical shop can use data mining techniques to analyze the how many customers come to their shop, majority from which location and who all are their competitors. These all information can be available from the history of that company. We can get information such as the no of customers per day and usually at what time the day is the pharmacy busy and from which location do the majority of customers come to buy the medicines. Thus using this data they can increase the efficiency by employing more no of employee during that time of the day and can distribute pamphlets if there is any new changes in the pharmacy especially in that location where the no of customers who come to is more and finally they can compare the past data with present ones and can see the changes of trends in their business.
A credit card company can use data mining techniques to know which all customers respond to their new offer from their previous data. Thus in case of informing the customers of offer they can send mails to only those customers who are keen to know about that offer rather than sending mails to entire customer list. Due to this the company won't annoy those customers who genuinely get irritated by such mails and they won't even lose their prestigious customers.
In case of hospitals, while diagnosing patients for blood pressure, doctors come to symptoms with the help of previously set defined symptoms. For example if patients have headache, dizziness, blurred vision, nausea and vomiting, chest pain and problem in breathing the person is more likely to have high BP. The doctors deduct this since all these symptoms have been previously defined by sum one who has deduced this information by applying data mining techniques.
The above examples emphasize on the same thing of improving the customer relationships and make their business profitable.
Data mining has some advantages as well as disadvantages:
Data mining can provide accuracy in customers purchasing behavior. For example, marketers of Software Company introduce their new scheme to only those customers who respond to the new products. In this manner, the marketers can provide an environment where they can improve the customer relationship.
Similarly in case of retail stores, the store manager can arrange items in a convenient manner in the shop (i.e. discounted items in one area and normal rates in another one) so that it is easy for customers to shop.
Data mining can be used in banks where it helps it to provide information of the previous history of the customers. With this data the bank can estimate whether to provide that person with loan or not.
Data mining is used to track the criminals by having the previous history about the location, types of crime and other behaviors.
Data mining helps researchers in spending less amount of time in particular work and concentrate on many other such works in the given amount of time.
During this internet age, lots of privacy information had been hacked by the people. In case of online shopping when the customers send their credit card information there are chances of such information to get leaked through net.
Companies display their personal information online. They disclose information such as consumer's social security number, address, account number, payment history etc where such information can be hacked by hackers. Such companies should methods to protect their information too.
Misuse of information
Information which is used through data mining can be misused and taken advantages of innocent people or to do some unethical business.
Data mining techniques are not 100% accurate. Mistakes can happen which can lead to some serious problem.
1.3 ARCHITECTURE OF DATA MINING
Data mining architecture has three main layers
Database layer with sub-layer of database and metadata.
Data mining application layer performing data management and algorithms.
Front end layer for administration, input parameter settings and results.
The following diagram tells us about the data mining architecture.
Data mining results, metadata
Parameters, data mining queries
Data mining applications
Figure 1.2: 3-tier architecture of data mining
1.3.1 Database layer
Database layer is basically used to store huge amounts of data in the form of flat files or tables. It is stored in the form of relational database management systems (RDBMS). It has got many sub layers. The following diagram tells us about the various layers of database tier
Output from data mining application
Data mining results
Prepared input data
Transformation cleansing and consolidation
Data and metadata for data mining
Metadata and data extracted from source systems
Figure 1.3: various layers of database
The metadata layer is the backbone of entire data mining architecture. It stores extracts and stores information from various sources. It also has transformations and cleansing rules.
Data layer consists of staging area, prepared input and data mining results.
The staging area basically holds the data in the form of flat files or tables in RDBMS. It does cleaning and transformation process on the available data and then transforms into prepared input data. This layer consolidates, summarizes and derives data based on the information needed by the company and transforms to the next tier data mining applications. The data mining output is basically used to display the condensed format of the data in the user friendly manner so that the user can know what the exact format of data is at this stage.
1.3.2 Data mining applications
Data mining applications has two main components:
Data mining tools/Algorithms
The following diagram shows the two primary components of data mining applications
Data mining results
Input data from database
Data mining tolls/algorithms
Figure 1.4: data mining applications
184.108.40.206 Data manager
Data manager manages the data and control the flow of input and output of the data. It performs the following functions:
Manage data sets: the data which we get in bulk is divided into multiple sets and it stores them. This data will be used further for processing. Here the data manager knows which data it has to extract and which one it has to ignore as per business specifications.
Input data flow: it takes data from database layer and then applies some transformation rules and then sends the data for further processing. The data is extracted from the database as per the format needed. It also controls the flow of data (like how it has to be transformed into the next level).
Output data flow: it transfers the data to the next level front end for user to view the output. Again it sends the data as per specific format and controls the flow of them.
220.127.116.11 Data mining tools/Algorithms
This is the most important step in data mining architecture. Different algorithms are applied to different types of data as per the business specification. Different types of algorithms are available such as KNN, decision tree algorithms etc. choice of algorithms depends upon the data available which is explained in the later part of the report. It analyzes the data and generates the result.
1.3.3 Front end
Front end is the user interaction layer. It has the following functions:
Input parameter settings
Data mining results/Visualization
Administration does the following tasks:
Data flow processes
Data mining routines
Error reporting and correction
User security settings
18.104.22.168 Input parameter settings
For the user to understand the output, some parameters are inserted for the fine tuning of the output. Some changes are made in the output and change in result is observed and accordingly parameters are inserted for the better interpretation and understanding of the result.
22.214.171.124 Data mining results
The result is thus finally displayed to the user in the form of user friendly manner. Thus the user can understand the result and make use of it for further data mining applications. The user can also analyze the result and display it in the form of small reports, charts etc.
1.4 STEPS INVOLVED IN DATA MINING
The various stages in data mining is shown in the picture
Figure 1.5: steps in data mining
The above diagram briefs us the following steps involved in data mining
Data integration: here, huge data of that related matter is collected from different sources and stored in the data warehouse.
Data selection: the data which is being collected from the above steps, only the required data for the business purpose is selected.
Data cleaning: the above selected data has to be cleaned in order to remove unnecessary errors, missing values, noisy or inconsistent data. So we clean the data by removing the above things.
Data transformation: the above cleaned data is still not yet ready to apply data mining algorithms. So the data has to be transformed by applying techniques such as smoothing, aggregation, normalization etc.
Data mining: now after the data transformation techniques the data is ready to apply various data mining algorithms. Different algorithms are used for different types of data. Algorithms like clustering, decision tree etc are used for their compatible data types. The use of specific algorithms for specific data is explained in the later part of the report.
Pattern evaluation and knowledge presentation: the patterns formed after applying data mining algorithms is done into more condensed form which involves steps such as visualization, transformation, removing redundant patterns etc.
Decision/use of discovered knowledge: finally fully processed data is now available to the user, where the user can make use of it for further application.
The data is basically divided into two sets
In the huge set of data, the target data has to bet set according to company's goal. This target data contains all important information of the project and ignores all the other unwanted ones. This target data is cleaned in this pre-processing steps by removing stop words such as and, or, at etc and also stemming words like one type of words which might exist in many forms which is unwanted form of data(e.g. study, studied, studying).
Now once the data has been pre processed we get training data. This data, if it works on all platforms that means the data has been tested and can work without any errors and this is called test data.
This is the main architecture of data mining. Without this step the data formed till now is of no use. Let us see into detail how exactly algorithms are applied.
Data mining is divided into four types:
Clustering: it's a task where all similar types of data is clustered together in the huge dataset.
Classification: it's a task of classifying the data. For e.g. when we check our inbox we classify our mail as inbox and spam where important messages is kept in inbox and unwanted ones in spam. Some of the algorithms used in this process is decision tree algorithm, naÃ¯ve Bayesian classification etc.
Regression: it's a task of finding any errors in the data.
Association rule learning: it's a task for finding relationships within the available data. For e.g. in a supermarket, the manager classifies the item which is more purchased by the customer and the items which are less purchased by them. Thus it can help supermarket to increase their business strategy.
We should know which algorithms we have to use for business tasks. If we use same algorithm to do the same type of task it will lead to different results and may also lead to more than one type of result. For e.g. in case of Microsoft decision trees algorithm it is not only used for prediction but also to reduce the no of columns which are not useful for the final result.
It is not compulsory to use only one data mining algorithm. You can use combination of algorithm. One type of algorithm is used for searching the data and the other one to get the result. For e.g. clustering algorithm can be used to form different patterns from the similar data available and then decision tree algorithm is used to display the result. You can also use multiple algorithm for one solution. For e.g. regression tree algorithm for financial forecasting information and rule based algorithm for market analysis.
For better understanding to choose which algorithm best suites the situation, the following table summarizes it:
Algorithms to use
Predicting a discrete attribute
E.g. to predict whether the customer will respond to the mail send by the credit card company to introduce a new offer
Decision tree algorithm
NaÃ¯ve bayes algorithm
Predicting a continuous attribute
E.g. to predict next year's sales
Decision tree algorithm
Time series algorithm
Predicting a sequence
E.g. analysis of company's web site
Sequence clustering algorithm
Finding groups of common items in transactions
E.g. in supermarket similar items are kept together in single shelves.
Decision tree algorithm
Finding groups of similar items
E.g. finding similarities in demographic data.
Sequence clustering algorithm
Table 1: different types of algorithm
The following diagram tells us about how to use data mining tools
Understand available data
Understand business problem
Prepare the data
1.6: data mining tools
For medical analysis to know the blood pressure level, we collected the frequency of words repeating from different patients report (unstructured data set) and formed a table (structured dataset)
Table2: Patients report
Thus it summarizes that
If BP=high then drug A
If BP=low then drug B
If BP =normal and age>=40 then drug A or else drug B
With the help of data mining it is very easy to deduce information in large dataset.
2. RELATED WORK
Now we come to the main part of the project. The topic of the project is to design an automatic work frequency for web pages. For this input is pre-processes web pages and the output is to display the frequency of words getting repeated in the inputted web pages. For this purpose in this topic we will discuss the related work done other than the main part of designing the program.
As we mentioned before the input of the program is pre processed web pages. We used web KB software to download and pre-process these web pages. How this is exactly done has been briefed below.
Pre-processing is the function of removing unwanted elements like full stops, commas etc. It even includes of removing the stop words (is, at, an) as well as stemming words (same form of the word existing in different format such as study, studying, studied).
This web KB algorithm is used to filter the unwanted and irrelevant words of the text document. Hence the accuracy of searching the words increases.
Pre-processing of web pages
Today's world, Data set has lots of irrelevant, inconsistent, noisy data because of the huge data information. So it is necessary to pre-process the data in order to maintain the accuracy. There are many pre-processing techniques. But the one which we have used for our project is pre-processing of web pages.
Block diagram for data pre processing
Figure 2.1: data processing
Log file is where the web pages where the user wants it to be pre-processed are kept there. Each time the user access the web pages it gets stored in the log. It stores information in the form of client's IP address or the URL of the website.
Unnecessary records are removed to get the defined pattern. In web pages two mains things are removed:
The records of graphics, video and format information
The records with failed HTTP status codes.
User and session identification
The browser should know which user wants to access the web pages and store it into the log in spite of many users using the same web pages. They identify it with the help of clients IP address.
This is the final stage of pre-processing steps. When everything is confirmed that this is the right web page requested by right browse pre-processing methods are applied to get the output. Thus after all this process we got pre-processed web pages.
The following diagram gives us the clear picture of how does a pre processed text document looks like
Figure 2.1: screen shot showing the image of the pre processed document
3. PROPOSED WORK
Now we come to the main part of programming. As mentioned before we have to design an automated frequency counter which accepts n text documents from the user and give the frequency of words getting repeated in that n inputted document as output.
We designed this program with the help of C language. The detailed description of the program is given below.
We have in total of 10 functions including the main functions. Before the program starts we define the structure node and specify the size of the link list through that node. Then we now concentrate on the main functions. We have four options to execute the program:
Write result to file
Display the current result
All these above cases are executed using switch option.
Case1: word counter- inside this we have total of 5 functions. First we input the file name and read the file using filereader function. We check if the file is available or not. If it is not available we cum out of this function, if it is available then we continue ahead. Next we go to wordsplitter function which splits different words including white spaces and stores it in array. Next is rtrim function where we replace all the white spaces with null character. Next is frequencychecker function where the stop words are compared and removed. Also the words are checked whether it is getting repeated or not. It is repeated then the count is increased. If it is not then we add these unique words with the help of next function add. After adding the unique words and checking how many words have got repeated from one text document inputted by the user, we have also an option of inputting multiple text documents. Once this function is exited we ask user do you want to input more text document. If the user says yes we go ahead with the above explanation of the program or if it is no we quit from the entire function. The following diagram gives us better picture explaining the above mentioned functions.
Check the availability of file
Want to enter more files
Figure 3.1: flow chart of
Case 2: write result to file- for our convenience we display the result in form of text document so that the output can be used for further processing in data mining. Here basically only one function is used writetofile. We ask the user whether he wants to use the default result got by the case 1. If it is yes we go ahead or if it is no then we asks user to input some other text file.
Case 3: display the current result- in this function we use only one function display. This function is basically used for user to check whether he is getting proper output before displaying it in the text document.
Case 4: quit - to exit from all the switch statements.
The following picture below gives us the detailed explanation of the entire program
Finally what makes sense to the entire program and which the user can understand is the output. As mentioned before the output is displayed in the form of text document so that it can be used for further for text mining.
Before running the program, we have to compile the screen for errors. The screen looks like this after compiling.
Figure 4.1: screen shot showing the compilation of the program
Now after compiling, we run the program.
Figure 4.2: screen shot showing the options displayed after running the program
The user has to choose the following options. We will the result of different options when chosen.
If the user chooses 1 then
Figure 4.3: screen shot showing what happens if the user chooses option1
If do you want to read another file="y" then again it will ask to input the file name.
In case file entered wrong then it will display file not found and ask the user to try again such as shown below.
Figure 4.4: screen shot showing the scenario of what if the user types incorrect file name
If the user chooses "no" then same word frequency counter options will be displayed or if the user chooses "yes" then it will ask to input the file name.
If the user chooses 2 then
Figure 4.5: screen shot showing what if the user chooses option2
In case do you want to use the default result="no" then it will ask to enter the file name which you want to display it in a file.
Now we check how the result has been displayed in the text document.
Figure 4.6: screen shot showing the output of the input pre-processed html file
The above diagram shows a sample of the output of the inputted text document. It gives the frequency of words getting repeated in the text document. Now this output can be used for further processing.
If the user chooses 3 then the current file's result will get displayed in the screen. This option is only to make sure that the result which we are displaying is correct bofre we display in the important text document.
If the user chooses 4 then quit the screen. We also have the default option incase the option chosen is wrong then it displays as wrong choice and asks to choose the given option only.
5. FUTURE WORK
The future of this project is the output which is obtained can be further used by applying data mining tools. For e.g. in case of supermarket with the help of previous purchasing history of the customers we collect the frequency of items purchased more compared to others. Due to this the manager of supermarket by applying data mining tools (e.g. decision trees) can give the non-selling items in a discounted rate for fast selling. This is how we can apply our project in our everyday life.