Data Gathering Exploration And Preparation Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The first step is to gather and collect data from data warehouse or data-marts. A data warehouse is a repository of information collected from multiple sources, stored under a single unified schema database. A data ware house is usually modeled by a multidimensional database structure, which would contain data from several database s that span the entire organization. (across branches, states, regions, or even countries) To facilitate decision making, the data in a data warehouse provide information from historical perspective and are typically summarized. A data-mart is a department subset of data warehouse, which usually focuses on selective subjects and narrowing it scope. Having large data warehouse allows decision support technologies like data mining to help knowledge workers (managers, analysts, and executives) to quickly and conveniently obtain an overview of the data to make strategic decisions based on information in the warehouse. Data warehouse systems are valuable tools in today's competitive world; in the last several years many firms have spent millions of dollars in building enterprise-wide data warehouses to support decision support technologies like data mining.

2.1.2 Descriptive Data- Adding Metadata to Data

The second step is descriptive mining task characterization like adding metadata in data warehouses. Descriptive Data is a summarization of general characteristics or features of a target class of data, which is essential to have an overall picture of your data. Descriptive data summarization helps us study the general characteristics of the data and identify the presence of outliners or noise which serves as a foundation for data processing and integration. These descriptive statistics help better understand quantitative and qualitative data. Knowing what kind of measure can help knowledge workers choose an efficient implementation for analysis.

2.1.3 Assembling Target Data Set

The fourth step is assembling target data sets for data mining. Data mining often requires data integration of data from multiple data sources into a single data warehouse. There are a number of issues to consider during data integration like schema integration and object matching. Entity identification problems for data attributes like names, data type, and null values. Redundancy is another important issue, with several data sources can cause redundancies in the resulting data set that would take up time, processing, and capacity. Some redundancies can be detected by correlation analysis, that can measure how strongly one attribute implies to the other based on relative data. Another important issue in data integration is the detection and resolution of data value conflicts like different data representation, scaling, or encoding. In data transformation can involve smoothing, aggregation, generalization, and normalization to consolidate into appropriate forms for data mining.

2.1.4 Cleansing Target Data Set

Real world data tend to be incomplete, noisy and inconsistent. The third step is to implement data cleaning process. Not cleaning data will contribute to inaccurate data for analysis and decision making, which will lead any company to disaster. One of the data cleaning process is fixing noisy data and fill in missing data. Noise data is a random error or variance in a measured variable, to fix this problem would require to use data smoothing techniques. Some techniques is smoothing by bin means, linear regression, or clustering to fix troublesome outliers. The missing data can be solved by using a global constant to fill in the missing null values. Another data cleaning process is discrepancy detection; discrepancies can be caused by several factors including poorly designed data entry forms, human data entry errors, and data decay. There may also be inconsistencies due to data integration, like different names in different databases for same data set. There are a number of different tools that can help in discrepancy detections like data scrubbing, which uses simple domain knowledge to detect errors and make corrections in the data. Another tool is data auditing tool, which find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. Having these discrepancies found will lead to more effective and efficient use of data mining.

2.1.5 Data Reduction- Feature Extraction (Vectors)

Data mining on selected data from a data warehouse without prior data reduction would result in huge data sets. Complex data analysis and mining on huge amounts of data can take a very long time, and consume most the processing for a single data set. Companies deal with hundreds of data sets per day which makes such analysis impractical and unfeasible to companies. The fifth step is to have data reduction techniques that can be applied to reduce data sets into much smaller volume, yet maintain the integrity of the original data. With data reduction, mining on reduced data set would produce more efficient analytical results. Strategies for data reduction would be data cube aggregation, where operations are applied to the data in the construction of a data cube. Another strategy is Attribute Subset Selection Strategy, where irrelevant, weakly relevant or redundant attributes are detected and removed. Discretization and Concept Hierarchy Generation Strategy is used to use raw data values for attributes and is replaced by ranges or higher conceptual levels. Numerosity Reduction strategy is where data is replaced by smaller data representation with parametric models.

2.2 Stage 2: Data Mining, Pattern Discovery and Extraction

2.2.1 Itemset Mining Methods

Frequent patterns are patterns that appear in a data set frequently. For example, a set of items, such as eggs and bread that appear frequently together in a transaction data set is a frequent itemset. Finding such frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data. Frequent pattern mining has become an important data mining task, which can help guide data mining procedures to discover interesting association or correlations. Itemset mining methods will take advantage of user preferences and taste for strategic management decisions. Another itemset mining method is sequential pattern mining, which searches for frequent subsequences in a sequence data set, where a sequence records an ordering of events. Another itemset mining method is structured pattern mining, which searches for frequent substructures in a structured data set. Itemset mining can be conducted in a single-dimensional or Boolean frequent itemset.

2.2.2 Bayesian Classification

Bayesian are statistical classifiers that use Bayes' theorem that help predict probable outcomes for a particular class. The classification algorithms exhibit high accuracy and speed when applied to large databases. Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. A very simple example would be predicting hypothesis of a customer potential of buying high end computer knowing customer's age and income. Bayesian classifiers are also useful in that they provide a theoretical justification for other classifiers that do not explicitly use bayes theorem like many neural network and curve-fitting algorithms outputs to maximum posteriori hypothesis. Bayesian classifications algorithms help analyze a large number of scenarios to find hidden data trends within.

2.2.3 Market Basket Analysis

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional or relational data sets. The discovery of interesting correlation relationships among huge amounts of business transaction records can help many business decision-making processes like catalog design, cross-marketing, and customer shopping behavior analysis. Market basket analysis is a typical tools used by companies to analyze customer buying habits by finding associations between the different items that customers purchases from there "shopping baskets". Market basket analysis can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. That analysis can help create better store design layouts, marketing advisements and increase inventory on popular products. With market basket analysis, these patterns can be represented in form of association rules: support and confidence rules to measure the chances of customer purchases. Like the example of the chances of a customer purchasing of a computer and also purchasing the anti-virus software.

2.2.4 Data Classification

Database has massive amounts of information to be used for intelligent decision making. Classification is one form of data analysis that can be used to extract models describing important data classes. Data Classification have numerous applications, including fraud detection, target marketing, performance prediction, manufacturing, and medical diagnosis. Data classification is a two-step process, in which the first step is the learning step. That is when classification algorithm learns from pre-determine classifiers made up by training sets and builds the associated class labels. Once learning step is completed, there will be a test data to determine the accuracy of test set to move on the second step. In the second step, unknown data will be used and the model will determine the classification. For example, the classification rules learned for approved loan applications can be used to approve or reject new or future loan applicants.

2.2.5 Classification by Decision Tree Induction

A decision tree is a tree like flowchart, where each internal node denotes a test on an attribute, each "branch" represents an outcome of the test, and each "leaf node" hold a class label. The topmost node is called a root node. Decision trees can easily be converted to classification rules. Decision tree induction is one of the most popular data mining tools that company use. The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and is simply exploratory discovery. The learning and classification steps of decision tree induction are simple and fast. Decision tree classifiers have good accuracy and many important decisions depend on the data at hand. One of the constraints that target data is fixed and rigid to apply decision tree induction.

2.2.6 Association of Rule Learning

Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-Then rules for classification. An IF-Then rule is an expression of a true or false equation. The "IF" part of a rule is known as the rule antecedent, the "Then" part is the rule consequent. In the rule antecedent condition consists of one or more attribute tests that are logically answered. If the condition in a rule antecedent holds true, then the rule consequent applies. However, if the condition in a rule antecedent holds false, then the rule consequent doesn't apply and another solution is given. For example, purchase concert tickets at the park. The ("IF") rule antecedent is persons age over twenty, the ("Then") rule consequent is allowed to purchase tickets. Any person purchasing a ticket must show ID indication age, if the condition in a rule antecedent holds true (<19), then the rule consequent applies. (Allowed to purchase tickets). However if the condition in rule antecedent holds false, then rule consequent doesn't apply. (can't purchase tickets). IF-Then rules can be extracted directly from the training data using sequential covering algorithm. Rules are learned one at a time in sequential order. Sequential covering algorithms are the most widely used approach to mining classification rules. Rule Quality Measures is a rule that considers an attribute test that must check current rule's condition before applying improved rules. For example, FOIL(first order inductive learner), a sequential covering algorithm that learns first-order rules in more complex variables. Many mathematical equations have multiply, divide, squared, cube, fractions, and logs. The use of Rule Quality Measures put order of operations when solving mathematical equations.

2.3 Stage 3: Validating Data Mining Results, Evaluation and Forecasting

2.3.1 Data Visualization and Reporting of Data Mining Results

To use data mining tool efficiently, people must be able to understand the data that is being presented. Not being able to understand what the data mining results are or what the data is applying is deemed a failure. With data mining, discover patterns and connections aren't straightforward, the hardest challenge for the user is to present the information in a creative way that is easy to understand. Data visualization plays a key role in data mining and requires creativity and imagination to present data. The best way is to present the model in a report is in a graphical way, allowing visualization of the model to present trends and forecasting. Having landmarks and maps are orienterring models that are easy for users to use. An important part of understanding data mining models is the context in which they connected, once able to understand the connections between business strategies and use the models to help solve real world problems.

2.3.2 Linear & Non Linear Regression

Regression analysis can be used to model the relationship between one or more independent variables and a dependent variables. the values of the predictor variables are known and the response variables is what need to be predicted. Many problems can be solved by linear regression, and even more can be tackled by applying nonlinear problems into linear one. Straight-line regression analysis involves a response variable,y, and a single predictor variable,x. it is the simplest form of regression, and models y as a linear function of x. (y=mx + b). To explain, applying linear regression form data plots to a straight line, although the points do not fall on a straight line, the overall pattern suggests a linear relationship between (x, y), and future predictions can be made. Multiple linear regressions are an extension of straight-line regression so as to involve more than one predictor variable. For nonlinear regression, many use the polynomial regression that often of interest when there is just one predictor variable. It can be modeled by adding polynomial terms to the basic linear model. Other models include generalized linear models that represent the theoretical foundation on which linear regression can be applied to categorical response variables. Another model is the Log-linear regression that approximates discrete multidimensional probability distributions.

2.3.3 Cluster Analysis

Clustering is the process of grouping the data into clusters that have high similarity in comparison to one another. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Dissimilarities of data are shown to be distant from the clustered data that are similarities to each other. Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing. In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns. Clustering is also data segmentation in some applications because clustering partitions large data sets into groups according to their similarity. Clustering can be used for outlier detections, to identify offset results and outcomes. Data clustering is under vigorous development, and been extensively studied for many years. Clustering is a form of learning by observation rather than learning by examples. In data mining, efforts have focused on finding methods for efficient and effective cluster analysis in large databases. In order to have effectiveness, clustering must have scalability, ability to deal with different types of attributes, high dimensionality, and discovery of clusters with arbitrary shape.