This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Data Mining is the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions (Connolly, 2004). This report will explore the concept of data mining and give insight to the main operations associated with its techniques: predictive modelling, database segmentation, link analysis, and deviation detection.
The concept of Data Mining is growing in popularity in business activity in general. We are living in an information era, and we have more and more data been generated in every aspect you can think of. Every time you swap your grocery card, trying to get a discount when buy wherever products. That's data being downloaded into a database, and most transaction you do, there is some sort of data download. Organizations are storing, processing and analysing data more than any time in history and this trend will continue to grow.
Data Mining is the incorporation of mathematical methods that may include mathematical equations, algorithms, traditional logistic regression, neural networks, segmentation, classification, clustering, etc. Those are all methods that utilize mathematics. Data Mining is applicable across industry sectors. Generally wherever we have processes, and wherever we have data, it is the application of these powerful mathematical techniques that will extract trends patterns
What is Data Mining?
Data Mining Tasks
In this chapter I will be describing some of the core Data Mining tasks. For each task I will give an example to illustrate each of the functions, which does not need to be used singularly, rather they can be combined together to have a more relevant output.
Data Mining Tasks are divided in Predictive tasks and Descriptive tasks:
"Classification maps data into predefined groups or classes" (Dunham, 2002). Those groups/classes are shaped before the actual data analysis. A classic example of classification application is to determine whether to authorize credit card purchase.
The example bellow illustrates a classification problem:
Airport security screening points uses pattern recognition systems to find potential criminals or terrorists. Those systems can scan any person that is crossing the airport hall to identify his distinctive patterns (eyes, shape of the head, mouth size, etc). Those patterns can be compared with many other patterns from the database to see if matches with the scanned person.
Regression is a technique that uses equation into a given dataset and assumes the data fits in some kind of function, such as linear or logistic. The linear regression is the simplest form, and it uses a straight line formula (y = mx + b) where it determines the value of b and m to predict the value of y, based on a given value of x.
Time Series Analysis
With Time Series Analysis the attribute is analysed as it changes over time. The data is usually recorded in an evenly space of time (every second, minutes, daily, hourly, etc). An example of time series would be daily closing values of Gold index price. Time series can be used to forecast events based on previous known past data. An example of time series forecasting would be predicting the stock price of a given company, based on its past performance.
Time Series: Gold daily value over past 8 months (Boursorama.com)
Clustering is a technique that seeks to divide cases into clusters, sharing similar qualities. The goal is simply to explore the structure of the data, sorting it into similar groups (cluster) that shares similar characteristics. "The greater the similarity within a group and the greater the difference between groups, the better will be the clustering." (Tan, 2006)
Different ways of clustering the same set of points
(a) Original points
(b) two clusters
(c) four clusters
Association Analysis is very important and one of the most used task within the Data Mining domain. It is very useful to uncover relationship among data, and to identify specific types of associations. A very common application of association rules would be analysing supermarket baskets to find out associations like (customers who bought milk also bought cheese)
Applications for Data Mining
Data mining methodologies can be use in a number of different environments, such as manufacturing process control, fraud detection, risk factors in medical diagnosis, image recognition, and many others. Follow bellow some of common domain in which Data Mining can be applied:
Advertising: When we talk about advertising and data, we think about Google. The search engine company works with data in the Petabyte scale, and it uses a non traditional way of organizing its data. Google uses mathematical models with an incredible number of data, and it is with no doubt one of the most profitable companies in the world. "Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising - it just assumed that better data, with better analytical tools, would win the day. And Google was right." (Anderson, 2008)
Shopping: Supermarkets have been keeping track on customer buying for long time. But only with a popularization of Data Mining, this data could really be used. For example, Tesco clubcards data can be analysed to predict what customer will buy, how they will pay, and even how many calories they will consume.
Education: Data Mining techniques can be applied in educational environment to analyse student learning behaviour, performance during academic year, and even prediction on how the student will perform during an exam.
Fraud Detection: Another relevant field of application, fraud detection affects different industries such as banking (credit card fraud detection, illegal transactions) and insurance (checking for false claims).
Risk Evaluation: Risk Evaluation estimates the risks connected with future decisions. For instance, a bank can develop a predictive model, based in past observations, to establish if is appropriate to give a mortgage to a customer.
Text Mining: Text Mining attempts to gather meaningful information from different kind of texts, in order to classify documents, books, e-mail and web pages. An example of text mining application includes creation of filters for e-mail messages and newsgroup.
Image Recognition: Useful to recognizing characters, identifying human faces, uncovering associations and anomalies. An application example includes detecting suspicious behaviours though surveillance video camera.
Web Mining: Web Mining applications are made to analyse clickstreams - the sequence of visit from users in websites. It is useful in analysing e-commerce websites, as it can offer customizes pages for customers
Limitations of Data Mining
Tools for Data Mining are very powerful, but they require very skilled specialist who can prepare the data and understand the output.
Data Mining brings out the patterns and relationships, but the significance and validity of those patterns must be made by the user.
Privacy and Ethics Concerns
As any other technology, Data Mining has its pitfalls with privacy and ethical concerns. There are many arguments about how privacy should be addressed. Some believes that Data Mining is ethically neutral; however, the way Data Mining is being used nowadays is raising many concerns, as advertising companies are buying customer spending data and behaviour at the cost of reduced privacy.
There is many ways in which data mining can compromise privacy. To start with, data mining requires an extensive data preparation which can uncover previously unknown information or patterns. For instance, many datasets from different sources can be putted together for the purpose of analysis (called data aggregation). The threat comes when someone, who has access to this data, would be able to identify or track down specific individuals.
There are risen concerns about how much organisations know about our personal lives. For example, if you aggregate datasets from many different sources, such as organisations, social networks, etc; you would know everything about your life: Your full address, telephone, age, how many cars you have, which cars you have, what type of house you live, what you do, what you eat, what you drink, where do you go, how much money you spend, your religion and beliefs, what are your likes and dislikes, etc. The list is infinite. What could happen if all those aggregate data falls in wrong hands? The information we have been inserting over the Internet could be used against us. For instance, USA data mining industry have software's in which monitors social media on the internet, the so called "Pre-Crimes", where "information about individuals that may ultimately transform the American workplace into a hopeless escape" (Burghart, 2010)
Following Burghart on his article in theSkyValleyChronicle.com, "Another company deploys an automation software that slogs through Facebook, Twitter, Flickr, YouTube, LinkedIn, blogs, and thousands of other sources, to develop a report on the 'real you' -- not the carefully crafted you in your resume."
Another recent problem happened when personal details of 100 million Facebook user profiles have been scanned and distributed over the internet. In my opinion this is just the beginning of a much greater problem that will arise with time.
"The Data Mining idea will grow in popularity, because data continue to grow. Think about social networking, such as Tweeter and Facebook. It is data that describe people, and what they do, what they are. Data is generated when you buy, sell, or even when you go to work data is being downloaded every time you swap your Oyster card into the underground system. More and more we having data gathering and data capturing, and it is the way it is in this information economy. The way to extract strategic information from that data. Those data resources. This is Data Mining." (Dalio, 2010) - http://www.telegraph.co.uk/technology/7963311/10-ways-data-is-changing-how-we-live.html