Loading The Computer Software Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

The users may think that only by loading the computer software; data mining will happen. Before moving forward with data mining many issues must be considered such as data pre-processing. Some users think about data mining; it is a data warehousing, SQL queries and reporting, software agents and online analytical processing (OLAP). The answer is; these are not data mining. Data mining increases computing power, improves data collection and management and it has statistical and learning algorithms. It is clear that decisions are not made by data mining; the people have to decide with their knowledge and experience. In this paper, we will present different data mining algorithms, their applications, issues and performance.

2- Clustering: It is a method by which a large set of data is grouped into clusters of smaller sets of similar data. It is concerned with finding structure in a large set of data. Clustering can be used in Pattern Recognition, Image Analysis and Bioinformatics. Main algorithms of the clustering are:

  • K-means
  • Self organized maps

3- Regression model: It is borrowed from statistics. It permits estimation of a linear and non linear function of independent variables that best predicts a given dependent variables. It is modeling of the data with least errors.

4- Association rule Learning: It is a search for relationships between the variables. It is used in web mining, intrusion detection and bioinformatics.

Any statistical or data analysis techniques may be useful for data mining. Figure - 1 below shows machine learning algorithms the most commonly used in data mining.

The rest of the paper is organized as follows:

In section 2 we present an overview of Neural Networks (NNs). Section 3 is about Decision Trees. Section 4 presents K-means Clustering Algorithms. Section 5 is about Genetic Algorithms (GAs). Section 6 is about Fuzzy Logic (FL). Section 7 is about Data Visualization. Section 8 is about K-Nearnest Neighbor (K-NN). Section 9 is about Bayesian Classification. Section 10 is about Link Analysis. Section 11 is about Regression. In section 12 we draw a comparison between these Data Mining Algorithms. Section 13 is about the conclusion.

2- Neural Networks:

Neural networks are used in system performing image and signal processing, pattern recognition, robotics, automatic navigation, prediction and forecasting and simulations. [1]

Issues in NNs:

  1. Learning/Training: A drawback of this process is that learning or training can take a large amount of time and resources to complete. Since results or data being mined are time critical, this can pose a large problem for end use. The NNs are better suited to learning on small to medium sized data sets as it becomes too time inefficient on large sized data sets. [1]
  2. Explicitness: The process it goes through is considered by most to be hidden and therefore left unexplained. This lack of explicitness may lead to less confidence in the results and a lack of willingness to apply those results from data mining, since there is no understanging of how the results came about. It is abovious this as the data sets variable increase in size, that it will become more difficult to understand how the NNs came to it conclusion. [1]
  3. Lackness of Knowledge: The lack of knowledge or general review on which type of NNs is best might result in the purchase of an application that will not make the best predictions or in general will just work poorly. Limit tradition of experience on which to draw when choosing between the nets that are on offer. [1]
  4. Lack of problem solving: The computer will never able to produce a solution that a human could not produce if given enough time. As a result the user has to program problems and solutions into the computer so that it can decide what are the best solution. If the user has no answer the chances are there the computer will neither. [1]

3- Decision tree:

Decision tree is used as an efficient method for producing classifiers from data. The goal of supervised learning is to create a classification model, known as a classifier, which will predict, with the values of its available input attributes, the class for some entity. In other words, classification is the process of dividing the samples into pre-defined groups. It is used for decision rules as an output. In order to do mining with the decision trees, the attributes have continuous discrete values, the target attribute values must be provided in advance and the data must be sufficient so that the prediction of the results will be possible. [1]

Decision trees are faster to use, easier to generate understanding rules and simpler to explain since any decision that is made can be understood by viewing path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obtained in the form of if-then-else, which can be used for the decision support systems, classification and prediction. Figure 2 illustrates how decision rules are obtained from decision tree algorithm. [1]

Issues in DTs:

  1. Small tree inherent problems with the representation. [1]
  2. It is good for small problems but quickly becomes cumbersome and hard to read for intermediate-sized problems. Special software is required to draw that tree. [1]
  3. If there is a noise in the learning set, it will fail to find a tree. [1]
  4. The data must be interval or categorical. Any data not in this format will have to be recorded to this format. This process could hide relationships. [1]
  5. Over fitting, large set of possible hypotheses, pruning of the tree is required. [1]
  6. DTs generally represent a finite number of classes or possibilities. It is difficult for decision makers to quantify a finite amount of variables. This sometimes affects the accuracy of the output, hence misleading answer. If the list of variables increases the if-then statements created can become more complex. [1]
  7. It is not good for estimation. [1]
  8. This method is not useful for all types of data mining, such as time series. [1]

4- K-means Clustering:

Unsupervised Learning depends on input data only and makes no demands on knowing the solution. It is used to recognize the similarities between inputs or to identify the features in the input data. It is used for finding the similar patterns due to its simplicity and fast execution. It starts with a random, initial partition and keeps re-assigning the samples to clusters, based on the similarity between samples and clusters, until a convergence criterion is met. The basic working of all the clustering algorithms is represented in figure 3.

Issues in K-means:

  1. The algorithm is only applicable to data sets where the notion of the mean is defined. Thus, it is difficult to apply to categorical data sets. There is, however, a variation of the k-means algorithm called k-modes, which clusters categorical data. The algorithm uses the mode instead of the mean as the centroid. [2]
  2. The user needs to specify the number of clusters k in advance. In practice, several k values are tried and the one that gives the most desirable result is selected. We will discuss the evaluation of clusters later. [2]
  3. The algorithm is sensitive to outliers. Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values. [2]
  4. The algorithm is sensitive to initial seeds, which are the initially selected centroids. Different initial seeds may result in different clusters. Thus, if the sum of squared error is used as the stopping criterion, the algorithm only achieves local optimal. The global optimal is computationally infeasible for large data sets. [2]
  5. The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids or hyper-spheres. [2]

5- Genetic Algorithms (GAs):

They are used in optimization problems, selection, crossover and mutation. They are called natural selection and evolution of the problem. [1]

Issues in GAs:

  1. It is not used in the large scale problems. It requires a significant computational effort with respect to other methods with parallel processing is not employed. [1]

6- Fuzzy Logic (FL):

In contras to binary, multi valued logic to deal with imprecise or vague data. It is a new field not very widely used. FL is a fad. There is no guarantee that it will work under all circumstances. [1]

Issues in FL:

  1. The language barrier is the major problem. In Japanese term it is 'clever' but in American it is 'fuzzy'. [1]
  2. This technology is still somewhat under developed in United States. It seems as if many American researchers have shunned it. It is not yet popular in data mining so it may not get the status due to its name. [1]

7- Data visualization:

This method provides the user better understanding of data. Graphics and visualization tools better illustrate the relationship among data and their importance in data analysis cannot be overemphasized. The distributions of values can be displayed by using histograms or box plots. 2D or 3D scattered graphs can also be used. [3]

Visualization works because it provides the broader information as opposed to text or numbers. The missing and exceptional values from data, the relationships and patterns within the data are easier to identify when graphically displayed. It allows the user to easily focus and see the patterns and trends amongst data. [3]

Issue in data visualization:

  1. As the volume of the data increases it becomes difficult to distinguish patterns from data sets. [3]
  2. Another problem in using visualization is displaying multi-dimensional or multi-variable models because only two-dimensions can be shown on a computer or paper. [3]

8- K-Nearest Neighbor (K-NN)

It is a classification technique that looks to the solutions of similar problems that have been solved previously. It decides where to place a new case "k" after examining the most similar cases of "k" or its neighbors. For example "N" is a new case; it is assigned to the class "x", because the algorithm assigns new case on the bases of most similar cases of "N" or its neighbor. It is illustrated in figure 4. [2][3]

K-NN models can be used in modeling of non-standard data type i.e. text and these models are suitable for few predictor variables because the output is easy to understand. The first thing is to calculate the distance between attributes in the data. Once the distances between cases is calculated, then select the set of already classified cases then decide the range of neighborhood to do comparison also count the neighbors themselves. [3]

Issues in K-NN

  1. It requires large computational load on the computer therefore all the data is kept in memory. This enhances the speed of K-NN. It is also known as K-NN Memory-based reasoning. [3]
  2. It processes new cases rapidly as compared to d. tree or neural networks. [3]
  3. It requires new calculation for new cases. [3]
  4. Numeric data can easily be handled by this algorithm, categorical variables need special handling. [3]

9- Bayesian Classification

As its name implies, Bayesian classification attempts to assign a sample x to one of the given classes using a probability model defined according to the Bayesian theorem. The latter calculates the posterior probability of an event, conditional on some other event. Basic prerequisites for the application of Bayesian classification are: [4]

  1. Knowledge of the prior probability for each class. [4]
  2. Knowledge of the conditional probability density function for each class. [4]

It is then possible to calculate the posterior probability using the Bayesian formula: [4]

Each new data tuple is classified in the class with the highest posterior probability.

Issues in Bayesian Classification

  1. Major drawbacks of Bayesian classification are the high computational complexity and the need for complete knowledge of prior and conditional probabilities. [4]

10- Link Analysis

It is a descriptive approach to identify relationships among values in a database. It uses association and sequence discovery. Association discovery finds rules about items that appear together in an event and sequence discovery is similar to association discovery but the association is related with time. The frequency of appearance of an association in the database is called its support or prevalence. For example 15 transactions out of 1,000 are for an association "A" then support for "A" would be 1.5%. A low level of support may indicate either the presence of bad data or the association is not important. [3]

The relative frequency, the occurrence of items and their combinations is useful to discover meaningful rules. For example given the occurrence of item "A", how often does item "B" occurs? This is called conditional predictability of B; given A. Confidence is used for conditional predictability. The confidence is (frequency of A and B) / (frequency of A). Association can also be measured through lift. Lift is (confidence of A and B) / (frequency of B). [3]

Graphical methods can also be used to visualize the structure of links as shown in figure 5. Each circle represents a value or an event. The connecting lines show a link. The thicker lines represent stronger linkages which indicate more important relationships or associations. [3]

Issues in Link Analysis

  1. It is difficult to decide what to do with discovered association rules. [3]
  2. The association or sequence rules are not rules; descriptions of relationships in a particular database. [3]
  3. There are no formal testing models which show the predictive power of these rules. [3]

11- Regression

It uses standard statistical linear regression technique but many problems are not based on linear projections of previous values. It uses existing values to forecast. Therefore, more complex techniques e.g., logistic regression, decision trees, or neural nets may be necessary to forecast future values. Same models are used for regression and classification such as decision tree algorithm or neural nets. A classification tree is to classify categorical response variables and a regression tree is to forecast continuous response variables. [3]

Issues in Regression

  1. It uses the linear regression of standard statistical technique but all the problems have not linear projection of previous values. [3]
  2. To solve this problem, decision tree algorithm or Neural nets are used. [3]

12- Results and Discussion

In order to build a predictive model, understanding of the data is the must. There are different of format of the data such as continuous and categorical. The categorical data can be either ordinal or nominal.

Table below shows the comparison of most commonly and popular used algorithms in data mining. It is clear from this comparison that all the data mining algorithms are suitable and produce useful results only on small to medium scale data sets. They are not appropriate for large datasets. This is the problem of scalability. There is another overhead; the learning and training of these data sets is required in all algorithms. If there is any noise in data sets, the extracted knowledge may be misleading. The choice of the algorithm depends on the intended use of extracted knowledge. The data can be used either to predict future behavior or describe patterns in an understandable form within discover process.

13- Conclusion

It is right to say that "There is no predictive method that is best for all applications" (International Knowledge Discovery Institute, 1999). In order to choose the different aspect of data mining, one should be clear about business goals, type of prediction, type of model, selection of correct algorithm and finally the product of selected algorithm in a hierarchy. The conclusion of this report is "k-means clustering" and "decision trees" are the best choices among the data mining algorithms for analysis and interpretation of the data. For the interpretation "Data visualization" using 2D or 3D scattered graphs is also another choice.

Distributed Computing

The distributed computing is also known as client-server models. The goal of this computing is to solve large computational problems. These are popular due to two main reasons: First, the nature of the problem requires using communication network that connects several computers and second the use of distributed system is beneficial for practical reasons. The followings are the foundations for the distributed computing:

  • Remote Communication.
  • Fault tolerance.
  • High availability.
  • Remote information access.
  • Security and privacy.

1- Peer-to-peer Computing:

It is a distributed network architecture where central coordination such as server is not required. The resources such as processing power, storage or bandwidth are directly available to other participants. It is completely a decentralized network of peers. All clients provide resources such as bandwidth, storage space and computing power. As nodes increases the total capacity of the system increase. The distributed nature of P2P networks increases its strength. The data can be found without relying on a centralized server. It is easy to maintain and configure. It is popular due to file sharing and network storage. [6][8]

Issue in P2P:

Followings are the main challenges in P2P:

  • Authenticity
  • Integrity
  • Availability
  • The major issue in P2P computing is security; unsecured and unsigned codes may be allowed remote access to files on a computer or even on the entire network. [6][8]

2- Cloud Computing:

It is a new type of distributed computing still emerging field in computer science. The remote machines owned by other company which will run everything for the user is called cloud computing. It will change entire computer industry. The only thing the user has to run interface software of cloud computing system. There is a significant workload shift i.e. the user's computer will not run the applications. It will decrease the demand of hardware and software. There is no limit of its applications. Everything can be done through cloud computing. The major advantage of this is the client can access his data any where at any time. It will reduce the need for advanced hardware which will bring the cost of hardware down. The client can take the advantage of network processing power if the cloud computing is using a grid at its back end. This is a step backward to early computers having only keyboard and terminal.

Issues in Cloud Computing:

The followings are major issues in cloud computing:

  • Data governance: Enterprises have sensitive data that requires proper monitoring and protection, moving data into cloud enterprise will lose their governance on own data.
  • Manageability:
  • Monitoring:
  • Compliance:
  • Cross-country Data migration:
  • Reliability, availability and recovery:
  • Security and privacy: are major concerns and issues in cloud computing.

3- Shared Computing:

It is a network of computers that work together to complete a task. It is sharing of processing power and other resources. A user can access the processing power of entire network. It is used only for the complex problems not for others. Its administration and design is complicated.

Issues in Shared Computing:

Followings are the main issues:

  • The safety and privacy is issue in shared computing.
  • It needs a plan when a system goes offline or unavailable.
  • Power consumption in shared computing is high which produces the heat.
  • The major concern about shared computing is that they are not comprehensive. They uses only processing power not the other resources like storage. The grid computing is more applicable then shared because of its resource sharing.

4- Utility Computing:

It is a business model in which a company outsources its computer support to other company. This support can be in the form of processing power, storage, hardware and software applications. The major advantage of utility computing is convenience because client has not to buy all hardware and licensed software for his business. He has to rely on another party to provide these services. [5]

Issues in Utility Computing:

Followings are main issues:

  • This type of computing model is suitable for medium or large scale enterprises, not suitable for small business.
  • Another main disadvantage of utility computing is reliability i.e. clients may hesitate to hand over duties to a smaller company where they feel the lost of data. It is an easy target of hackers.
  • The major challenge in utility computing is that the consumers are not educated about its service. Its awareness is not very widespread. [5]

5- Grid Computing:

It is a type of distributed computing where every computer can access the resources such as processing power, memory and data storage of other computer on the network, turning it into a powerful supercomputer. It is a high performance computing. This is not a new concept but not yet perfect. People are still working on creating, establishing and implementing standards and protocols. The applications of grid computing are limitless.

Issues in Grid Computing:

Followings are main challenges:

  • Resource sharing & coordinated problem:
  • Coordinated problem solving in dynamic:
  • Multi-institutional virtual organizations:
  • Data protection:
  • No clear standard:
  • Better understanding as simple as possible:
  • Difficult to develop:
  • Lack of grid-enabled software:
  • Centralized management:
  • The limited number of users are allowed the full access of network otherwise the control node will be flooded with processing requests which can create deadlock situation.

6- Pervasive Computing:

It is a share of small, inexpensive, robust networked processing devices distributed at all scales in everyday life. [9]

Issues in Pervasive Computing:

Followings are main challenges in pervasive computing:

  • System design and engineering
  • System modeling
  • Human computer interaction models [9]

7- Mobile Computing:

It is an ability to use technology while moving. It is a building block for pervasive computing. It is a tomorrow's network technology. It will revolutionize the way computers are used. Wireless communication, mobility and portability are salient features of mobile computing. It is a paradigm shift in distributing computing. [7]

Issues in Mobile Computing:

Followings are the limitations of mobile computing:

  • There are no such standards of security of data.
  • The time of power supply is limited.
  • There may be signal problem due to transmission interferences.
  • The excess use of mobile devices may cause potential health hazards.
  • Small and limited user interface for human on the device is available.
  • Small storage capacity
  • Risks to data
  • Mobility: moving from one coverage area to another.
  • Insufficient bandwidth [7]


When resources are shared in any type of distributed computing then the users have to compromise on privacy and security of their data. These two are the major threats in all forms of distributed computing. It is clear from above discussion that there are no standards in design, implementation and management of these computing forms. The migration of data cross the country is another issue. A mountain of data is there. The question arises how it will bring the correct data on the request of a user from the clouds of data? The answer of all these issues is the use of data mining techniques.


  1. Wang, John. "Data Mining Opportunities and Challenges" Idea Group Publishing ISBN: 1-59140-051-1, chapter IX page 235 and chapter XVI page 381
  2. Liu, Bing. "Web Data Mining Exploring Hyperlinks, Contents, and Usage Data", ISBN: 13 978-3-540-37881-5, Springer Berlin Heidelberg New York, chapter 3 and chapter 4.
  3. "Introduction to Data Mining and Knowledge Discovery", ISBN: 1-892095-02-5, Third Edition by Two Crows Corporation, page numbers: 11,12,13,15.
  4. Symeonids, Andreas. Pericles, Mitkas. "AGENT INTELLIGENCE THROUGH DATA MINING", ISBN 0-387-24352-6, chapter 1, 2, 3.
  5. Y. Chee, B. Rajkumar, D. Marcos, Y. Jia, S. Anthony, V. Srikumar, P. Martin "Utility Computing on Global Grids" Chapter 143, Hossein Bidgoli (ed.), "The Handbook of Computer Networks", ISBN: 978-0-471-78461-6, John Wil ey & Sons, New York, USA, 2007.
  6. L. Jiangchuan, Rao. Sanjay, Li. Bo, Z. Hui "Opportunities and Challenges of Peer-to-Peer Internet Video Broadcast", Proceedings of the IEEE Volume 96, Issue 1, Jan. 2008 Page(s):11 - 24
  7. Satyanarayanan. M., "Fundamental Challenges in Mobile Computing", School of Computer Science, Carnegie Mellon University, Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing Philadelphia, Pennsylvania, United States, Pages: 1 - 7, ISBN:0-89791-800-2
  8. Jimenez. Raul, Eriksson. Lars-Erik, Knutsson, Bj¨orn., "P2P-Next: Technical and Legal Challenges", 6th Swedish National Computer Networking Workshop (SNCNW'09) and 9th Scandinavian Workshop on Wireless Adhoc Networks (Adhoc'09) Sponsored by IEEE VT/COM Sweden
  9. Satyanarayanan. M., "Pervasive Computing: Vision and Challenges", School of Computer Science, Carnegie Mellon University, Personal Communications, IEEE Aug 2001, Volume: 8,Issue: 4 On page(s): 10-17, ISSN: 1070-9916