Data Streams Using Granula Computing Computer Science Essay

Published:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

In recent times we have numerous applications with massive data handling which impose many limitations like storage capacity and computational time. Traditional data mining techniques are not meant to deal with such huge and real time analysis. So these techniques are needed to be reviewed and tuned according to requirements of data streams analysis. Numerous examples from the real life around us represent this problem some of those are listed here

Telecommunication and Networks: Mobile networks, calling cards, network monitoring and analysis, sensor networks analysis, web logs and clicks-stream data analysis.

Business and banking sector: Credit cards, ATM machines, stock exchange and electricity power distribution and generation.

Health Sector: Discovering the rate, causes and source of spreading disease like swine flu etc. Such analysis is suitable for disease control and prevention.

Discovering the evolution of workload in an e-commerce' server, which can help in dynamically fine tune the server to obtain better performance.

Discovering meteorological data, such as temperatures throughout a region.

Market basket data: Analysis of the items brought together in superstore to maintain the supply and demand chain.

Data stream have some unique characteristics that make it difficult to handle and process.

Data stream is like the flow of the river, one should have to view and analyze the data only once. There is no second chance to analyze or visit the data items again. In some applications, stream elements are generated at a rapid rate. The elements thus have to be processed in a timely manner in order to keep up with the stream rate. Usually a single look at each data stream object has to suffice in the analysis process.

Data generation is very fast to handle this fast data, real time processing algorithms and techniques are required otherwise the chances of loss of data increases. This results errors in the mined results.

Data streams are potentially unbounded and can thus generate an infinite amount of data. This entails that data streams are typically not stored in their entirety. Rather, once a data stream element has been disregarded after processing, it cannot be recovered.

In most applications, the characteristics of the data stream as well as its elements evolve over time. This property is referred to as temporal locality and adds an inherent temporal component to the data stream mining process [5]. Stream elements should thus be analyzed in a time-aware manner to accommodate the changes in stream characteristics.

To handle (or mine) continuously-generated data streams time windows are commonly used. [3,8,22,26,27,29]. Depending on the stream data mining application, three different window models, named the landmark window model, damped window model, or sliding window model can be used. In the landmark window model, data mining is performed utilizing all the data between a particular point of time, called the landmark, and the current time. In the damped window approach (also referred to as the time fading window approach), different weights are assigned to the data depending on the order of appearance of the data; the new data receive higher weights than the older data. In the sliding window model, only the fixed length of recently-generated data is used in mining operations. For example, given a window |W| on a transactional database, only the latest |W| transactions or all transactions in the last |W| time unit are utilized for data mining. As new transactions arrive, the oldest transactions in the sliding window expire. The sliding window model is therefore widely used to find recent frequent patterns in data streams [4,22,26,27,29].

Overview of Existing Methods for sliding window handling

The variety of available data stream mining algorithms is too large to present an overview of all existing techniques. Numerous papers available in which authors try to demonstrate the recent state-of-the-art work performed in data stream mining. Algorithms, systems, and frameworks for data stream mining are discussed in [44]. A survey of association rule mining, and thus frequent itemset mining techniques for data streams is presented by [64]. [14] Presents a brief introduction into requirements for clustering data streams. An overview of clustering techniques, not specific to the context of stream mining, is given by Berkhin in [16]. The book "Data Streams: Models and Algorithms" by Charu Aggarwal [5] contains a recent and fairly complete overview of existing methods for data stream mining.

Mining frequent patterns from data streams has become one of the most important and challenging problems [10] for a wide range of online applications, as discussed above. The results of data stream mining are approximated results. To have exact mining results algorithms are need to be scan the data at-least twice which is not in the case of data streams. The first paper published for mining frequent item set for basket data was proposed years back by Aggarwal et al. [1]. It has been actively and widely studied by the data mining and knowledge discovery research community. Perhaps it is the most cited paper in the field of data mining. The Apriori algorithm has the limitation of k scans of the data base and large number of candidate generation; many of these candidates are determined in-frequent after successive scanning of the database. The Apriori algorithm assumes that the database is memory resident. The maximum number of database scans is one more than the cardinality of the largest frequent item set. To overcome these problems, Han et al. [11] proposed the frequent pattern tree (FP-tree) and the FP-growth algorithm; this algorithm reduces the number of database scans by two and eliminates the requirement for candidate generation. Introduction of this highly compact FP-tree structure led to a new avenue of research with regard to mining frequent patterns with a prefix-tree structure. However, the static nature of an FP-tree and the requirement for two database scans limit the applicability of this algorithm to frequent pattern mining over a data stream.

The next important algorithm proposed by [100] is the Lossy Counting algorithm. Its features are the reduced memory usage of a mining process, the counts of frequent itemsets can be kept in a secondary storage and only a buffer for the batch-processing of transactions is kept in main memory. As the buffer is enlarged, more number of newly generated transactions can be batch-processed together, so that the algorithm is more efficiently processed. When the number of frequent itemsets is large, accessing the information of frequent itemsets in a secondary disk needs more time. Due to this reason, this algorithm is not appropriate for an online data stream.

estDec Method proposed by [101] for frequent itemset mining for data stream. Occurrence count of an itemset in the transactions generated so far by a monitoring tree whose structure is a prefix tree. This method produces False Positive results. Item-set are found instantly causing prefix tree volume large (occupy more memory). Hence, a little memory is available for new item-sets.

Granular Computing

The term "Granular computing (GrC)" was first introduced in 1997 by T. Y. Lin [105]. The concept of granular computing was initially called information granularity or information granulation related to the research of fuzzy sets in Zadeh's early papers [102]. GrC has emerged as an important technique of problem solving. The term granular deals with the construction, interpretation and representation of the granules while the computing stands for the analysis using these granules. These granules are constructed with the help of information tables.

The objective of granular computing research is to build an efficient computational model for handling huge amounts of data, information, and knowledge. Informally, some computing theories and models that deal with granules may be called granular computing or softer version of granular computing.

Granular computing is used for Association rule mining in data mining by many researchers. But their research was about traditional data mining. To best of my knowledge, granular computing has never been used for frequent itemset mining for data streams.

Recently, [107] information granules are defined as two types: intension and extension. Intension is a attribute value pair while extension is the collection of records in data sets. These granules are generated by scanning the data and stored in the granule tables. In this granule table all the granules are stored as 3 tuples consisting of number of objects, intension, and a pointer to the linked list of extension (objects). Frequent k-itemsets are generated by combining two frequent (k-1)-itemsets from the different nodes of the same link list. They compare their results with aprori and claim that their results are better as number of candidate itemset generation is reduced with low computational time. Author uses the rough set computational model to define the elementary granules.

In another paper [108], the authors use the rough set theory for finding association rules. First, they use existing reduct generation algorithm based on rough set theory to find attribute reducts from the original data set and then generate a set of association rules in terms of each reduct using the classic Apriori algorithm. The resulting association rules are in the form such that the antecedents of a rule is from the value of condition attributes in a reduct, and the consequents of a rule are from the value of decision attributes from the original data set. Since reducts contain the most representative and important condition attributes of a decision table, they assume that rules extracted based on these reducts are representative of the original decision table and therefore are considered more important than the rules generated without using reducts. With this intuition, the rules generated from these reducts are used to construct a decision table, with each of the individual rules being a condition attribute and decision attributes being kept the same. The reduct extracted from such a decision table would contain representative and important attributes, which are the association rules. The reduct generation algorithm is, in turn, applied to this newly constructed decision table, and the result is the reduct consisting of a set of rules that are most important.

Above discussed two papers are published this year but they deals with the conventional data mining while to handle data stream the scenario is different as you do not have the full picture of the data sets as data is unseen and coming in batches and sometimes continuous. Here we will show the handling of data stream in perspective of granular computing.

Stream(DS) is a six-tuples defined as

is a time granule, which is a time section and could be set one second/minute/hour. etc. An information table is length is determined by TGr.

T is the set of transactions in TGr. Transaction T is identified with the time stamp. The length of the stream is equal to the cardinality of the set T.

is the finite nonempty set of attributes.

is the language defined using attributes in .

Va is the nonempty set of values for.

Ia is the information function from T to Va.

Illustration of the Data stream handling with Granular computing

Consider be a set of binary m attributes called items in data stream. be the n transaction in the stream. Each transaction t where t∈T is represented as a binary vector with t[k]=1 if t contains the item k i.e. ik and t[k]=0 if t doesnot contain item ik.

Consider four transactions in the data stream

TID-01={b,d,e}, TID-02 ={a,d}, TID-03={c,d}, TID-04 ={b,d}, TID-05={b,c}, TID-06 ={c,d}, TID-07 ={b,c,e}, TID-08 ={b,c,d,e} TID-09 ={b,c,d}

We can show these transactions in the form of binary decision tables as

Table 1 Transaction Dataset of data stream.

Transaction ID

I1 (a)

I2 (b)

I3 (c)

I4 (d)

I5 (e)

TID-001

0

1

0

1

1

TID-002

1

0

0

1

0

TID-003

0

0

1

1

0

TID-004

0

1

0

1

0

TID-005

0

1

1

0

0

TID-006

0

0

1

1

0

TID-007

0

1

1

0

1

TID-008

0

1

1

1

1

TID-009

0

1

1

1

0

From the table-1 it is evident that if an item is present in a transaction its presence is shown by 1 and if it is not present in a particular transaction the value of that attribute is 0. E.g; the first transaction we receive in the stream consist of two items "b,d,e" so the corresponding attribute values are 1 while the unavailability of other attributes are shown by the value 0.

Let X⊆I we say that transaction t satisfies X if all the items ik are in X. Then the support count of itemset X is defined by , here |*|is the cardinality of a set or simply we can say that number of elements in the set Ti is equal to the support for the set X. They take the support threshold (similar to apriori) which means that out of nine transactions if the item is present in two transactions it would be called frequent. The

Table 2 Basic Information granules (candidate 1-compound itemsets)

Itemsets

Transaction of granu les

Binary expression of granules

Size of granules

[a]

{002}

000100000

1

[b]

{001,004,005,007,008,009}

100110111

6

[c]

{003,005,006,007,008,009}

001011111

6

[d]

{001,002,003,004,006,008,009}

111101011

7

[e]

{001,007,008}

100000110

3

Because all sized of granules satisfy the support threshold, itemsets in above table are all frequent 1-itemsets. Computing with granules with AND bitwise operation upon the binary expression of corresponding granules we have 2-itemsets compound granules as in the table 3.

Table 3 Candidate 2-item compound granules (Candidate 2-itemsets)

Itemsets

Transactions of granules

Binary expression of granules

Size of granules

{a,b}

000000000

0

{a,c}

000000000

0

{a,d}

{002}

010000000

1

{a,e}

000000000

0

{b,c}

{005,007,008,009}

000010111

4

{b,d}

{001,004,008,009}

100100011

4

{b,e}

{001,007,008}

100000110

3

{c,d}

{003,006,008,009}

001001011

4

{c,e}

{007,008}

000000110

2

{d,e}

{001,008}

100000010

2

From the Table3 we can we can see the last six objects (frequent-2 candidate itemsets) satisfy the support threshold and so the above Table3 transforms into Table4 with frequent 2-itemsets.

Table 4 Frequent 2-item compuind granules (frequent 2-itensets)

Itemsets

Transactions of granules

Binary expression of granules

Size of granules

{b,c}

{005,007,008,009}

000010111

4

{b,d}

{001,004,008,009}

100100011

4

{b,e}

{001,007,008}

100000110

3

{c,d}

{003,006,008,009}

001001011

4

{c,e}

{007,008}

000000110

2

{d,e}

{001,008}

100000010

2

Based on Table 4 we can now find out candidate 3 itemsets sets which would be like shown in table5

Itemsets

Transactions of granules

Binary expression of granules

Size of granules

{b,c,d}

{008,009}

000000011

2

{b,c,e}

{007,008}

000000110

2

{b,d,e}

{001,008}

100000010

2

Which represents that three itemset although the item set {c,d,e} is pruned because its granule size is one.

So we can find the three association rules with the support Threshold (2/9)

It's the end of mining process for the first time granule (First window) but the stream process goes on and we have new items added into the transaction data set. We proceed with the above explained process and some of the frequent itemset become infrequent in the future while itemsets which are not frequent currently may become frequent in future.

Conclusion

In this report we study how to model a data stream with the help of granular computing. And find that granular computing have a great potential to outclass the traditional data stream mining process. Comprehensive study of granular computing techniques would help to device different new algorithms and techniques which would be more simple and faster than the traditional data stream algorithms. In future we will study other

[42] Pawlak, Z. Rough Sets: Theoretical Aspects of Reasoning About Data. Kluwer, Boston, 1991.

2] C.K.-S. Leung, Q.I. Khan, DSTree: a tree structure for the mining of frequent sets from data strms, in: Proc. ICDM, 2006, pp. 928-932.

[23] C.K.-S. Leung, Q.I. Khan, Efficient mining of constrained frequent patterns from streams, in: Proc. 10th International Database Engineering and Applications Symposium, 2006.

[24] H.-F. Li, S.-Y. Lee, Mining frequent itemsets over data streams using efficient window sliding techniques, Expert Systems with Applications 36 (2009) 1466-1477.

[25] H.-F. Li, S.-Y. Lee, M.-K. Shan, An efficient algorithm for mining frequent itemsets over the entire history of data streams, in: Proc. International Workshop on Knowledge Discovery in Data Streams, 2004.

[26] J. Li, D. Maier, K. Tuftel, V. Papadimos, P.A. Tucker, No pane, no gain: efficient evaluation of sliding-window aggregates over data streams, SIGMOD Record 34 (1) (2005) 39-44.

[27] C.-H. Lin, D.-Y. Chiu, Y.-H. Wu, A.L.P. Chen, Mining frequent itemsets from data streams with a time-sensitive sliding window, in: Proc. SIAM International Conference on Data Mining, 2005.

[100] G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proc. VLDB, 2002, pp. 346-357.

[101] J.H. Chang and W.S. Lee. Finding recent frequent itemsets adaptively over online data streams. In Proc. of the 9th ACM SIGKDD, pp. 487-492, 2003.

[102] L.A. Zadeh, Fuzzy sets and information granurity, Advances in Fuzzy Set Theory and Applications, M. Gupta, R.K. Ragade, R.R. Yager (eds), North-Holland Publishing Company, 3-18, 1979.

[105] Lin, T. Y.: Granular computing. In: Announcement of the BISC Special Interest Group on Granular Computing (1997).

[106] Zadeh, L. A.: Towards a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets and Systems, v.90 n.2, 111-127,Sept. 1. 1997.

[107] Qiu, T., Chen, X., Liu, Q., and Huang, H. 2010. Granular computing approach to finding association rules in relational database. Int. J. Intell. Syst. 25, 2 (Feb. 2010), 165-179

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 25, 165-179 (2010)

Writing Services

Essay Writing
Service

Find out how the very best essay writing service can help you accomplish more and achieve higher marks today.

Assignment Writing Service

From complicated assignments to tricky tasks, our experts can tackle virtually any question thrown at them.

Dissertation Writing Service

A dissertation (also known as a thesis or research project) is probably the most important piece of work for any student! From full dissertations to individual chapters, we’re on hand to support you.

Coursework Writing Service

Our expert qualified writers can help you get your coursework right first time, every time.

Dissertation Proposal Service

The first step to completing a dissertation is to create a proposal that talks about what you wish to do. Our experts can design suitable methodologies - perfect to help you get started with a dissertation.

Report Writing
Service

Reports for any audience. Perfectly structured, professionally written, and tailored to suit your exact requirements.

Essay Skeleton Answer Service

If you’re just looking for some help to get started on an essay, our outline service provides you with a perfect essay plan.

Marking & Proofreading Service

Not sure if your work is hitting the mark? Struggling to get feedback from your lecturer? Our premium marking service was created just for you - get the feedback you deserve now.

Exam Revision
Service

Exams can be one of the most stressful experiences you’ll ever have! Revision is key, and we’re here to help. With custom created revision notes and exam answers, you’ll never feel underprepared again.