Graph Mining Is A Motivated Task Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Graph mining is a challenging task that extracts novel and useful knowledge from graph data [1]. Due to novel applications in molecular substructure finding, chemical informatics, social networks, communication networks, computer vision, etc., many large-scale graphs and sets of graphs are becoming available for analysis. Frequent pattern mining on graphs is one of the best ways to obtain useful patterns, e.g. for differentiating graphs in classification and clustering tasks. Schematic graph mining methods presume that the amount of data is limited and that it is therefore possible to store all data in memory or local secondary storage. There is no restriction on processing time, either. In the Data Stream model, we have space and time limitations. Essentially, these limitations involve the use of incremental techniques. Furthermore, as the data source is not need fully stationary, methods must be able to adapt to changes over time in the data as well. Closed patterns are powerful representatives of frequent patterns [2], since they eliminate redundant information. A frequent pattern is closed for a dataset if none of its proper superpatterns has the same support as itself. Other possible definitions of a frequent closed pattern are the following:

a frequent pattern is closed if it is one of the intersections of all transactions that contain it.

a frequent pattern is closed if no superpattern is contained in exactly the same transactions as itself.

Generally, there are many fewer frequent closed graphs than frequent ones. In fact, we can obtain all frequent subgraphs with their support from the set of frequent closed subgraphs with their support. So, the set of frequent closed subgraphs maintains the same information as the set of all frequent subgraphs. The rest of the paper is organized as follows. Sections 2 give background details. Section 3 give details of related work on graph mining. Section 4 gives the details of FCGMiner. Experimental results are given in Section 5, and conclusions are drawn in Section 6.


Generally speaking, a graph is a set of vertexes (nodes) that are interconnected by a set of edges (links). The graphs used in frequent closed graphs mining are assumed to be labeled graphs. In the following paragraphs widely used definitions, used later in this paper, are introduced.

Subgraph: Given two graphs G1(V1, E1, LV1 , LE1 , ϕ1) and G2(V2, E2, LV2 , LE2 , ϕ2), G1 is a subgraph of G2, if G1 satisfies: (i) V1 ⊆ V2, and ∀v ∈ V1, ϕ1(v) = ϕ2(v), (ii) E1 ⊆ E2, and ∀(u, v) ∈ E1, ϕ1(u, v) = ϕ2(u, v). G1 is an induced subgraph of G2, if G1 further satisfies: ∀u, v ∈ V1, (u, v) ∈ E1 ⇔ (u, v) ∈ E2, in addition to the above conditions. G2 is also a supergraph of G1.

Supergraph: If G1 is a subgraph of G2, then G2 is said to be a supergraph of G1.

In general frequent graph pattern mining partakes a common problem which also exits in frequent itemset and sequence mining. Based on the Apriori principle, a frequent n-edge labeled graph may contain 2n frequent subgraphs. This raises a serious problem for mining complete subpatterns: A graph pattern with 64 edges, which is not unusual in molecule structures and chemical compounds, may generate 264 frequent subgraphs, although most subgraphs actually deliver nothing interesting but redundant information if all of them share the same support. In the context of itemset and sequence mining, an interesting solution, called mining closed frequent itemsets and sequences, has been proposed t overcome this difficulty. For example, among 422 chemical compounds that are confirmed to be active in an AIDS antiviral screen dataset provided by Developmental Therapeutics Program in NCI/NIH, there are nearly 1,000,000 frequent graph patterns if the minimum support is 5%. This makes the further analysis on frequent graphs nearly impossible. In the context of itemset and sequence mining, an interesting solution, called mining closed frequent itemsets and sequences, has been proposed t overcome this difficulty. A frequent pattern I with the same support in the dataset. The counterpart alternative in graph mining, mining closed frequent graphs, may lead to promising directions as well. For example, for the dataset mentioned above, among these 1 million frequent graphs, only about 2,000 are closed frequent graphs. If further analysis (such as classification or clustering) is performed on the set of closed graphs instead of on the whole set of frequent graphs, it will achieve the same accuracy with less redundancy and better efficiency. However, until recently, this crucial alternative has not been touched. This is, based on our analysis, partly because of the complexity in graphs.

The goal of this paper is to develop a tool for mining frequent closed graph patterns. Previous research studies in closed itemset mining and subsequence mining have demonstrated that such mining may lead to more efficient methods than mining all patterns by developing some convoluted techniques to prune the pattern search space. However, some elegant properties of itemsets and sequences do not hold in graphs.

Related Work

There is a large body of work on itemset mining from data streams; see the survey [3] and the references therein. We can divide these data stream methods into two different classes depending on whether they use a landmark window, containing all the examples seen so far, or a sliding window. Moment, CFI-Stream and IncMine are state-of-the-art algorithms for mining frequent closed itemsets over a sliding window. CFI-Stream stores only closed itemsets in memory, but maintains all closed itemsets as it does not apply a minimum support threshold, with the corresponding memory penalty. Moment stores much more information besides the current frequent closed itemsets, but it has a minimum support threshold to reduce the number of patterns found. IncMine proposes a notion of semi-FCIs that increases the minimum support threshold for an itemset as it is retained longer in the window. Only a small fraction of these methods deal with frequent closed mining. For trees, the work in [4] shows a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. The approach is different to the one presented in this paper, as it does not use coresets [5], or weighted frequent mining techniques. In terms of graphs, two main algorithms exist for mining frequent closed graphs:

CloseGraph [6]: based on gSpan [7], a miner for finding frequent subgraphs, based on depth-first search (DFS)

MoSS [8]: an extension to MoFa [9] based on breadth- first search (BFS).

Aggarwal [4] present a mining methodology to find frequent and dense patterns in graph streams. Their notion of density is based both on node-occurrence and edge density, and they present an approach based on finding an approximation of the exact method with the use of a minhash approach. This paper presents a tool for interactive and efficient Mining frequent closed graphs from the evolving data streams [1].


We are interested in sets of graphs, capable with a partial order relation among these graphs. Given two graphs g and g', we say that g is a subgraph of g', or g' is a supergraph of g, if g g'. Two graphs g, g' are said to be comparable if g g' or g' g. Otherwise, they are incomparable. Also we write g g' if g is a proper subgraph of g' (that is, g g' and g ≠ g'). The input to our data mining process is a dataset D of weighted transactions, where each transaction s € D consists of a transaction identifier tid, and a graph. The dataset is a finite set in the standard setting, and a potentially infinite sequence in the data stream setting. Tids are supposed to run sequentially from 1 to the size of D. Following standard usage, we say that a transaction s supports a graph g if g is a subgraph of the graph in transaction s. The number of transactions in the dataset D that support g is called the support of the graph g. A subgraph g is called frequent if its support is greater than or equal to a given threshold minsup. The frequent subgraph mining problem is to find all frequent subgraphs in a given dataset. Any subgraph of a frequent graph is also frequent and, therefore, any supergraph of an infrequent graph is also infrequent (the antimonotonicity property). We define a graph g to be closed (implicitly, w.r.t. to D) if none of its proper supergraphs has the same support as g has in D.

In this paper we focus our examples and experiments on molecular graphs, but our approach is general enough to be applied to any type of graph.

FCGMiner: Frequent Closed Graphs Miner

Use either SI (MKS) FCGMiner is a tool for mining frequent closed graphs from evolving data streams [3] it assumes that the graph data coming from a time varying source is maintained as a graph database. It then generates the implicated graphs from the constructed graph instances which are taken from the evolving data streams. These implicated graphs are saved on the main memory for further processing. From these implicated graphs the subgraph mining is done. A frequent subgraph is said to be closed if it has support count greater than the support count of its supergraph. Eventually, super graphs are generated by the FCGMiner and these super graphs are compared with the subgraphs which are obtained from the subgraph mining. If the subgraph support count is greater than the supergraph support count then the graph is said to be a frequent closed graph.

Interactive retrieval of Closed graphs tree

The generated closed graphs are saved inside the secondary memory as serialized object files [10]. These closed graphs which are in the auxiliary memory can be viewed as a tree. Then using the Stream Query button we can retrieve the closed graphs based on the time period given as the input. Hence it provides an interactive mining of frequent closed graphs.

Experimental Work

We tested our algorithms on synthetic and real datasets. First, we compare our new FCGMiner with MoSS and FFSM [11] as to show the benefits in memory and time. All experiments were performed on a 1.6 GHz Pentium machine with 1.5 GB of memory main memory, running Windows XP SP2 operating system. MoSS is a batch framework for finding frequent molecular substructures and discriminative fragments in a database of molecule descriptions. MoSS is not restricted to molecular data sets; it can mine arbitrary data sets of attributed graphs. Apart from the default MoSS/MoFa algorithm, this framework also contains the gSpan and CloseGraph algorithms as special processing modes. For our experiments we use the following real datasets:

NCIFCRF(AIDS) The open National Cancer Institute (NCI) dataset [12] consists of approximately 250,000 structures. It is based on a large NCI database, built using samples from organic synthesis submitted to NCI for testing. While about half of the NCI database is not accessible, the other half of the structural data is free of any disclosure and usage restrictions and therefore termed "open". This data is public domain and often referred to as the "Open NCI Database".

The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. Understanding the shape of a molecule helps to understand how it works. This knowledge can be used to help deduce a structure's role in human health and disease, and in drug development. The structures in the archive range from tiny proteins and bits of DNA to complex molecular machines like the ribosome.

Confirmed moderate active chemicals from NIH anti-HIV virus drug screen contain the structures of the chemical compounds used for the drug against the HIV virus.

First, we compare the runtime and memory usage of the FCGMiner with the others. As all methods are implemented in Java, the memory shown is the memory allocated by the Java Virtual Machine. The results of our experiments in Figures 1 to 7. We test MoSS on PDB dataset and FFSM on NIH dataset. For the Open NCI database, which is the smaller dataset, we see that our FCGMiner need more time but less memory. However, our mining tool outperforms the MoSS for the PDB dataset and FFSM for NIH dataset.

Conclusions and Future work

In this paper, we have presented FCG Miner an interactive GUI tool for mining frequent closed graphs on evolving data streams. From our experimental observations, it is clear that this tool gives efficient results for mining the closed graphs over time varying data and it can also retrieve the data interactively based on the time period. Future work will analyze the impact of the size of batches (batches of graph instances) and also the impact of redundant closed graphs inside the secondary memory.