Network security management in context-aware enterprise thorough sat

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.



Security Management In an enterprise network

Security management in an Enterprise network is an undertaking an Supreme in rank in the world today. i.e. Corporations, universities, schools and several government depend on the availability and functionality of these networks for communication, data storage and retrieval, and many other day-to-day activities and responsibilities. Enterprise networks continue to grow in both size and complexity, as increasingly more tasks are automated and more information is stored and made available through the network. With this increase, concerns for network security continue to grow. Network security management has generally become an ever-changing, never completed chore.

Security risks exist in almost every enterprise network, often undetected or unknown to those assigned to maintain it. Vulnerabilities are regularly discovered in almost every software application. The number of vulnerabilities published" (or discovered and made known to the public) increased dramatically several years ago and has remained consistently high.

Extrapolating from the number of publications published as of April 2009, a comparably high number of vulnerabilities will be published in 2009. This trend seems likely to continue, with a large and increasing number of vulnerabilities discovered and published in each ensuing year.


MulVAL is a security analysis tool for automatically recognizing security vulnerabilities and multistage attacks that can potentially lead to exploitation of network resources [39]. The MulVAL attack graph toolkit uses this gathered data for the generation of a logical dependency attack graph [40]. My research and corresponding implementation utilize the MulVAL analysis engine and attack graph toolkit, although it easily could be applied to other attack graph models with similar semantics.

Technical background

In this section, we first recall the basis of the most commonly used DPLL search procedure. Then, we introduce some computational features of modern SAT solvers. Finally, a brief description of multicore based architectures is given.

2.1 DPLL search

Most of the state of the art SAT solvers are simply based on the Davis, Putnam, Logemann and Loveland procedure, commonly called DPLL [10]. DPLL is a backtrack search procedure; at each node of the search tree, a decision literal is chosen according to some branching heuristics. Its assignment to one of the two possible values (true or false) is followed by an inference step that deduces and propagates some forced literal assignments such as unit and monotone literals. The assigned literals (decision literal and the propagated ones) are labeled with the same decision level starting from 1 and increased at each decision (or branching) until finding a model or reaching a conflict. In the first case, the formula is answered to be satisfiable, whereas in the second case, we backtrack to the last Many improvements have been proposed over the years to enhance this basic procedure, leading now to what is commonly called modern SAT solvers. We also mention that, some look-ahead based improvements are at the basis of other kind of DPLL SAT solvers (e.g. Satz [28], kcnfs [12], march-dl [22]) particularly efficient on hard random and crafted SAT categories.

2.2 Modern SAT solvers

Modern SAT solvers [31, 14], are based on classical DPLL search procedure [10] combined with (i) restart policies [19, 25], (ii) activity-based variable selection heuristics (VSIDSlike) [31], and (iii) clause learning [30]. The interaction of these three components being performed through efficient data structures (e.g., watched literals [31]). All the state-ofthe- art SAT solvers are based on a variation in these three important components. Modern SAT solvers are especially efficient with ”structured” SAT instances coming from industrial applications.


The aim of the project is providing Network security with enhanced techniques for a Enterprise.

- Implementing the SAT applications

- correct the misconfigurations and achieve the security and usability

Research Problem

Enterprises require a secured Network management for processing of data related to enterprise. Enterprise network grow in both size and complexity with this increase we have enhance the security management. In any network configuration management is a crucial problem. Earlier works provides just the security attack path information and address how to correct the security problem and configures the network. But the present proposal is based on SAT (Boolean satisfiabilty solving)Techniques and helps to find the reason about that attack, usability requirements, cost of actions etc., This is enhancement for existing system with efficient Network management in enterprises.


The present researchers find it hard to design and maintain centralized architecture database in mobile networking for global roaming. So it increases the interest in research and to provide an appropriate database for global roaming. In this project I have designed the multi-tree database architecture with centralized nature using Visual Basic .NET. This design shows how the centralized database can maintain each user registration, service profiles, call setup and location update.


The thesis is organized as follow:

Chapter 1: Introduction: It gives the background, aim, objectives and research problems of the thesis.

Chapter 2: Review of the existing database architecture in cellular mobile communication system: It presents the strategies which are proposed to overcome the drawbacks in the existing two-level database architecture.

Chapter 3: System Analysis: It describes about the call setup, location registration and updates of existing system and proposed system.

Chapter 4: Design and Development: provides detail about database designing, software requirements for the design, data flow diagram and testing of the system.

Chapter 5: Implementation: It explains about the how the implemented design is working and the screen shorts of the output are provided.

Chapter 6: Conclusion and future works: concludes the thesis and suggests information to the future work.


High Dimensional Indexing:

A number of techniques have been introduced to address the high-dimensional indexing problem such as the X-tree [5] and the GC-tree [6]. Although these index structures have been shown to increase the range of effective dimensionality, they still suffer performance degradation at higher index dimensionality.

Feature Selection

Feature selection techniques are a subset of dimensionality reduction targeted at finding a set of untransformed attributes that best represent the overall data set. These techniques are also focused on maximizing data energy or classification accuracy rather than query response. As a result, selected features may have no overlap with queried attributes.

Index Selection

The index selection problem has been identified as a variation of the Knapsack Problem, and several papers proposed designs for index recommendations based on optimization rules. These earlier designs could not take advantage of modern database systems' query optimizer. Currently, almost every commercial RDBMS provides the users with an index recommendation tool based on a query workload and uses the query optimizer to obtain cost estimates. A query workload is a set of SQL data manipulation statements. The query workload should be a good representative of the types of queries that an application supports.

Automatic Index Selection

The ideas of having a database that can tune itself by automatically creating new indexes as the queries arrive have been proposed. In a cost model is used

to identify beneficial indexes and decide when to create or drop an index at runtime. Costa and Lifschitz propose an agent-based database architecture to deal with an automatic index creation. Microsoft Research has proposed a physical-design alerter to identify when a modification to the physical design could result in improved performance.

Index Selection

Index Selection is a method of artificial selection in which several useful traits are selected simultaneously. First, each trait that is going to be selected is assigned a weight, the importance of the trait. I.e., if you were selecting for both height and the darkness of the coat in dogs, if height was more important to you, one would assign that a higher weighting. For instance, heights weighting could be ten and coat darkness' could be two. This weighting value is then multiplied by the observed value in each individual animal and then the score for each of the characteristics is summed for each individual. This result is the index score and can be used to compare the worth of each organism being selected. Therefore, only those with the highest index score are selected for breeding via artificial selection.

This method has advantages over other methods of artificial selection, such as tandem selection, in that you can select for traits simultaneously rather than sequentially. Thereby, no useful traits are being excluded from selection at any one time and so none will start to reverse while you concentrate on improving another property of the organism. However, its major disadvantage is that the weightings assigned to each characteristic are inherently quite hard to calculate precisely and so require some elements of trial and error before they become optimal to the breeder.

Query Access pattern:

The advantage of using data access objects is the relatively simple and rigorous separation between two important parts of an application which can and should know almost nothing of each other, and which can be expected to evolve frequently and independently. Changing business logic can rely on the same DAO interface, while changes to persistence logic do not affect DAO clients as long as the interface remains correctly implemented.

In the specific context of the Java programming language, Data Access Objects can be used to insulate an application from the particularly numerous, complex and varied Java persistence technologies, which could be JDBC, JDO, EJB CMP, Hibernate, or many others. Using Data Access Objects means the underlying technology can be upgraded or swapped without changing other parts of the application.

Dynamic index analysis framework :


Existing System

Existing works in enterprise network security analysis, such as MulVAL can identify all possible attack paths in an enterprise system and output them in a graph structure.

This structure provides a good foundation for addressing how to automatically find the best way to correct the security problems presented in the analysis results.

After several more iterations of reassessing and reassigning costs, the suggested changes are to patch the existing vulnerability in webServer and either remove one employee's account on the VPN server or ensure that the employee's log-in information will not be compromised.

Query response does not perform well if query patterns change. Because it uses static query workload. Its performance may degrade if the database size gets increased.

Tradition feature selection technique may offer less or no data pruning capability given query attributes.

Dis-advantage of Existing System

* Enterprise network security analysis, such as MulVAL can identify all possible attack paths in an enterprise system and output them in a graph structure.

Proposed System

* In this approach, we use two SAT solving techniques:

MinCostSAT can utilize user-provided discrete cost values, associated with changing a given configuration setting or allowing an attacker a given amount of access, to find a mitigation solution that minimizes the cost in terms of both security risk and usability impairment.

By examining the UnSAT core, a minimal set of configurations and policy requirements that conflict, we narrow the complexity of a reconfiguration dilemma to a straightforward choice between options.

Past policy decisions by the human user are placed in a partialorder lattice and used to further reduce the scope of the decisions presented to the user.

By this approach, the human user is not expected to fully comprehend the effects, both good and bad, of all aspects of network configuration, but only to make decisions on the immediate relative value of specific instances of usability and security.

We develop a flexible index selection frame work to achieve static index selection and dynamic index selection for high dimensional data.

A control feedback technique is introduced for measuring the performance.

Through this a database could benefit from an index change.

The index selection minimizes the cost of the queries in the work load.

Online index selection is designed in the motivation if the query pattern changes over time. By monitoring the query workload and detecting when there is a change on the query pattern, able to evolve good performance as query patterns evolve


By examining the UnSAT core, a minimal set of configurations and policy requirements that conflict, we narrow the complexity of a reconfiguration dilemma to a straightforward choice between options.

Past policy decisions by the human user are placed in a partialorder lattice and used to further reduce the scope of the decisions presented to the user.



Input design is the process of converting the incoming inputs from the user to a computer based format. It is one of the most essential functions of the computerized system and is one of the major problems of the system.


Output design usually refers to the information and results that are obviously generated by the system for all of the end users. It acts as a main reason in developing the system and the basis in which it determines the usefulness of the application. The output is designed in such a way that it is attractive, convenient and informative. Forms are designed in VB.NET with various features, which make the console output more pleasing.

As the output an acts as an important source of information to the users, better system relationship design should be implemented so as to help in decision making. Forming design elaborates the way output is presented and the layout available for capturing information.


4. Hardware and Software Requirements



7. Hardware:



10. RAM : 512 MB DD RAM


12. HARD DISK : 20 GB

13. FLOPPY DRIVE : 1.44 MB

14. CDDRIVE : LG 52X




18. Software:


20. Front End : Java, Swing

21. Tools Used : NetBeans IDE 6.1

22. Operating System : WindowsXP




22.3. Implementation and Testing

22.4. In this section, I will show the full application of these techniques to the sample graph

22.5. shown in Figure 4.3. I will explain
rst how the various system policies were constructed

22.6. for this test; next, I will demonstrate the usefulness of the iterative UnSAT core elimination

22.7. approach; and,
nally, I will present the MinCostSAT approach for producing network recon
guration suggestions. 4.6.1 Policy Construction

22.8. Three types of system policies have been described in this chapter:

22.9. Security policy - Speci
es network privileges that should never be acquired by an

22.10. attacker. It is likely that this policy does not demand full security, but includes only

22.11. privileges that truly must not be held by an attacker.

22.12. Usability policy - Speci
es network con
guration settings that should not be altered.

22.13. Similar to the security policy, it is important that this policy does not insist upon all

22.14. current network permissions, but includes only the con
gurations that are fundamentally

22.15. necessary to the continued usefulness of the network.

22.16. Cost policy - Assigns discrete cost values to each variables in the formula C. For

22.17. privilege variables, the value is determined by the cost incurred if an attacker gains

22.18. that privilege. For con
guration variables, the value is determined by the cost incurred

22.19. by disabling that con
guration setting, thereby altering or reducing the usability of

22.20. the network.

The most essential part in system development life cycle is testing the system. In a newly designed system, the number and nature of errors depends on the specification of the system and the time frame given for the design.

All the subsystems in a newly designed system should altogether, where as in the original process the subsystems work independently. During this process all the subsystems in the process are gathered into one pool and tested in order to determine whether it is able to meet the user requirements.

Testing has been done at two levels which are, Testing the entire system and testing of individual modules. During the system testing, which makes sure that the design works according to our proposed database structure and it has been used experimentally. The intention behind designing each test case is to find out errors in the system and the way it will process the error.

One of the most important stages in the software development is testing as it helps in analyzing the efficiency and the reliability of the software. Software testing is performed in various stages given below.

  • Unit Testing
  • System Testing
  • Integration Testing
  • Acceptance Testing

22.20.1. UNIT TESTING:

Unit testing is the preliminary stages in software testing where in every module in the system are tested with respective to the specifications generated in the integration process. This test is performed to validate internal logic of each module. The testing for the logic related to interaction between modules is avoided in the beginning. The received input received and generated outputs are also validated to check whether they fall in the expected range/category. Unit testing is executed in sequence (i.e., one at a time) starting from the bottom which is the lowest module.

The programs in the software application where tested to check whether the logic applied is correct and to identify any possible errors in the code. Each module of the software application was properly tested and all possible errors where identified and rectified. The system is functioning properly.


In this stage of testing each module is checked whether it is properly integrated with the system. Its main objective is to verify whether the software addresses all the requirements identified in the design stage.

System testing is performed to further identify faults that the previous stage is unable to identify. The system testing is performed with respect to the system user in the operational environment and each module is validated in the view of rectifying forced system failures. Under this testing, low volumes of transactions are generally based on live data. This volume is increased until the maximum level for each transactions type is reached. The entire system is tested to check whether it can recover after a major system failure, data integrity is also checked to avoid data loss.


In integration testing, the tested modules are combined into sub-systems, which are then tested. The goal of integration testing to check whether the modules can be integrated properly emphasizing on the interfaces between modules.

The different modules were linked together and integration testing done on them.


The objective of the acceptance test is to tell the user about the validity and reliability of the system. It verifies whether the system operates as specified and the integrity of important data is maintained. User motivation is very important for the successful performance of the system. All the modules were tested individually using both test data and live data. After each module was ascertained that it was working correctly and it had been "integrated" with the system. Again the system was tested as a whole. We hold the system tested with different types of users. The System Design, Data Flow Diagrams, procedures etc. were well documented so that the system can be easily maintained and upgraded by any computer professional at a later. Acceptance testing is done with live data provided by the client to ensure that the software works satisfactorily. This test focuses on the external behavior of the system. Data was entered and acceptance testing was performed.


Module Description


Initialize the abstract Representation

Calculate the Query Cost

Index Selection loop

Calculate the performance


Module 1: Initialize the abstract Representation:

In this module we are monitoring the user queries and initialize the abstract representation. In this module we collection the user transaction and from that we are finding the frequently selected item. And by applying the association rule we are calculation the relationship between the records and finding the support and confidence. Based on that we are initializing the abstract representation. The initialization step uses a query workload and the data set to produce a set of Potential Indexes P, a Query Set Q,

and a Multidimensional Histogram H according to the support, confidence, and histogram size specified by the user. The description of the outputs and how they are generated are given as follows: Potential index set P. This is a collection of attribute sets that could be beneficial as an index for the queries in the input query workload. This set is computed using traditional data mining techniques. Considering the attributes involved in each query fro the input query workload to be a single transaction, P consists of the sets of attributes that occur together in a query at a ratio greater than the input support. Formally, the support of a set of attributes A is defined as where Qi is the set of attributes in the ith query, and n is the number of queries. For instance, if the input support is 10 percent and

attributes 1 and 2 are queried together in greater than 10 percent of the queries, then a representation of the set of attributes {1, 2} will be included as a potential index. Note

that because a subset of an attribute set that meets the support requirement will also necessarily meet the support, all subsets of attribute sets meeting the support will also be

included as potential indexes (in the example above, both sets {1} and {2} will be included). As the input support is decreased, the number of potential indexes increases. Note that our particular system is built independently of a query optimizer, but the sets of attributes appearing in the predicates from a query optimizer log could just as easily

be substituted for the query workload in this step. If a set occurs nearly as often as one of its subsets, an index built over the subset will likely not provide much benefit over the query workload if an index is built over the attributes in the set. Such an index will only be more effective in pruning data space for those queries that involve only the subset's attributes. In order to enhance analysis speed with limited effect on accuracy, the input

confidence is used to prune the analysis space. Confidence is the ratio of a set's occurrence to the occurrence of a subset. While data mining the frequent attribute sets in the query workload in determining P,we also maintain the association rules for disjoint subsets and compute the confidence of these association rules. The confidence of an association rule is defined as the ratio that the antecedent (left-hand side of

the rule) and consequent (right-hand side of the rule) appear together in a query, given that the antecedent appears in the query. Formally, the confidence of an association rule

fset of attributes Ag ! fset of attributes Bg, where A and B are disjoint, is defined as In our example, if every time attribute 1 appears, attribute 2 also appears, then the confidence of f1g ! f2g ¼ 1:0. If attribute 2 appears without attribute 1 as many times as it appears with attribute 1, then the confidence f2g ! f1g ¼ 0:5. If we have set the confidence input to 0.6, then we will prune the attribute set {1} from P, but we will keep attribute set {2}. We can also set the confidence level based on the attribute set cardinality. Since the cost of including extra attributes that are not useful for pruning increases with increased indexed dimensionality, we want to be more conservative with respect to pruning attribute subsets.

Index Selection Notation List confidence could take on a value that is dependent on the set cardinality. Although the Apriori algorithm was appropriate for the relatively low attribute query sets in our domain, a more efficient algorithm such as the FP-Tree [24] could be applied if the attribute sets associated with queries are too large for the Apriori technique to be efficient. Although it is desirable to avoid examining a high-dimensional index set as a potential index, another possible solution in the case where a large number of attributes are frequent together would be to partition a large closed frequent item set into disjoint subsets for further examination. Techniques such as CLOSET [25] could be used to arrive at the initial closed frequent item sets. Query set Q. This is the abstract representation of the query workload. It is initialized by associating the potential indexes that could be beneficial for each query with that query. These are the indexes in the potential index set P that share at least one common attribute with the query. At the end of this step, each query has an identified set of possible indexes for that query. Multidimensional histogram H. An abstract representation of the data set is created in order to estimate the query cost associated with using each query's possible indexes to

answer that query. This representation is in the form of a multidimensional histogram H. A single bucket represents a unique bit representation across all the attributes represented in the histogram. The input histogram size dictates the number of bits used to represent each unique bucket in the histogram. These bits are designated to represent only the single attributes that met the input support in the input query workload. If a single attribute does not meet the support, then it cannot be part of an attribute set appearing in P. There is no reason to sacrifice data representation resolution for attributes that will not be evaluated. The number of bits that each of the represented attributes gets is proportional to the log of that attribute's support. This gives more resolution to those attributes that occur more frequently in the query workload. Data for an attibute that has been assigned b bits is divided into 2b buckets. In order to handle data sets with uneven data distribution, we define the ranges of each bucket so that each bucket contains roughly the same number of points. The histogram is built by converting each record in the data et to its representation in bucket numbers. As we process data rows, we only aggregate the count of rows with each unique bucket representation, because we are just interested in estimating the query cost. Note that the multidimensional histogram is based on a scalar quantize designed on data and access patterns, as opposed to just data in the traditional case. A higher accuracy in representation is achieved by using more bits to quantize the attributes that are more frequently queried. or illustration, Table 2 shows a simple multidimensional histogram example. This histogram covers three attributes and uses 1 bit to quantize attributes 2 and 3, and 2 bits to quantize attribute 1, assuming that it is queried more frequently than the other attributes. In this example, for attributes 2 and 3, values from 1 to 5 quantize to 0, and values from 6 to 10 quantize to 1. For attribute 1, values 1 and 2 quantize to 00, 3 and 4 quantize to 01, 5, 6, and 7quantize to 10, and 8 and 9 quantize to 11. The .'s in the column “Value” denote attribute boundaries (that is, attribute 1 has 2 bits assigned to it). Note that we do not maintain any entries in the histogram for bit representations that have no occurrences. Thus, we cannot have more histogram entries than records and will not suffer from exponentially increasing the number of potential multidimensional histogram buckets for high-dimensional histograms.

Module 2: Calculate the query Cost:

The query cost will be calculated based on the potential index and Query Set.

The query will be used to find the best index. We say the index is best one if it give result for all the querys. Once generated, the abstract representations of the query set Q and the multidimensional histogram H are used to estimate the cost of answering each query by using all possible indexes for the query. For a given query-index pair, we aggregate the number of matches that we find in the multidimensional histogram by looking only at the attributes in the query that also occur in the index (bits associated with other attributes are considered to be don't cares in the query matching logic). To estimate the query cost, we then apply a cost function based on the number of matches that we obtain by using the index and the dimensionality of the index. At the end of this step, our abstract query set

representation has estimated costs for each index that could improve the query cost. For each query in the query set representation, we also keep a current cost field, which we

initialize to the cost of performing the query by using sequential scan. At this point, we also initialize an empty set of suggested indexes S. Cost function. This is used to estimate the cost associated with using a certain index for a query. The cost function can be varied to accurately reflect a cost model for the database system. For example, one could apply a cost function that amortized the cost of loading an index over a certain number of queries or use a function tailored to the type of index that is used. Many cost functions have been

proposed over the years. For an R-Tree, which is the index type used for this work, the expected number of data page accesses is estimated I where d is the dimensionality of the data set, and Ceff is the number of data objects per disk page. However, this

Histogram Example formula assumes that the number of points N approaches to infinity and does not consider the effects of high dimensionality or correlations. A more recently proposed cost model is given in [27], where the expected number of pages accesses is determined as where r is the radius of the range query, d is the data set dimensionality, N is the number of data objects, and Ceff is the capacity of a data page. Although these published cost estimates can be effective to estimate the number of page accesses associated with using a multidimensional index structure under certain conditions, they have certain characteristics that make them less than ideal for the given situation. Each of the cost estimate formulas require a range radius. Therefore, the formulas break down when assessing the cost of a query that is an exact match query in one or more of the query dimensions. These cost estimates also assume that data distribution is independent between attributes and that the data is uniformly distributed throughout the data space.In order to overcome these limitations, we apply a cost estimate that is based on the actual matches that occur over the multidimensional histogram over the attributes that form a potential index. The cost model for R-trees that we use in this work is given by ðdðd=2Þ _ mÞ; where d is the dimensionality of the index, and m is the number of matches returned for query matching attributes in the multidimensional histogram. Using actual matches eliminates the need for a range radius. It also ties the cost estimate to the actual data characteristics (that is, incorporates both data correlation between attributes and data

distribution, whereas the published models will produce results that are dependent only on the range radius for a given index structure). The cost estimate provided is conservative in that it will provide a result that is at least as great as the actual number of matches in the database. By evaluating the number of matches over the set of attributes that match the query, the multidimensional subspace pruning that can be achieved using different index possibilities is taken into account. There is an additional cost associated with higher dimensionality indexes due to the greater number of overlaps of the hyperspaces within the index structure and additional cost f traversing the higher dimensional structure. A penalty is imposed on a potential index by the dimensionality term. Given equal ability to prune the space, a lower dimensional index will translate into a lower cost. The cost function could be more complicated in order to more accurately model query costs. It could model query cost with greater accuracy, for example, by crediting complete attribute coverage for coverage queries. It could also reflect the appropriate index structures used in the database system such as B+-trees. We used this particular cost model, because the index type was appropriate for our

Module 3. Index Selection Loop

After initializing the index selection data structures and updating estimated query costs for each potentially useful index for a query, we use a greedy algorithm that takes into

account the indexes that were already selected to iteratively select indexes that would be appropriate for the given query workload and data set. For each index in the potential index set P, we traverse the queries in query set Q that could be improved by that index and accumulate the improvement associated with using that index for that query. The improvement for a given query-index pair is the difference between the cost for using the index and the query's current cost. If the index does not provide any positive benefit for the query, no improvement is accumulated. The potential index i that yields the highest

improvement over the query set Q is considered to be the best index. Index i is removed from the potential index set P and is added to the suggested index set S. For the queries

that benefit from i, the current query cost is replaced by the improved cost. After each i is selected, a check is made to determine if the index selection loop should continue. The input indexing constraints provides one of the loop stop criteria. The indexing constraint could be any constraint such as the number of indexes, total index size, or the total number of dimensions indexed. If no potential index yields further improvement or the indexing constraints have been met, then the loop exits. The set of suggested indexes S contains the results of the index selection algorithm. At the end of a loop iteration, when possible, we prune the complexity of the abstract representations in order to make the analysis more efficient. This includes actions such as eliminating potential indexes that do not provide better cost estimates than the current cost for any query and pruning from consideration those queries whose best index is already a member of the set of suggested indexes. The overall speed of this algorithm is coupled with the number of potential indexes analyzed, so the analysis time can be reduced by increasing the support or decreasing the confidence. Different strategies can be used in selecting the best index. The strategy provided assumes an indexing constraint based on the number of indexes and therefore uses the total benefit derived from the index as the measure of index “goodness.” If the indexing constraint is based on the total index size, then the benefit per index size unit may be a more appropriate measure. However, this may result in recommending a lower dimensional index and, later in the algorithm, a higher dimensional index that always performs better. The recommendation set can be pruned in order to avoid recommending an index that is nonuseful in the context of the complete solution.

Module 4.Calculate the performance:

For each response of the query we are calculating the Performance. Based on that performance the index modification will be performed..


System Input

The system input is made up of new incoming queries and the current set of indexes I, which is initialized to be the suggested indexes S from the output of the initial index

selection algorithm. For clarity, a notation list for the online index selection.


The system simulates query execution over a number of incoming queries, that is, the abstract representation of the last w queries stored asW, where w is an adjustable window

size parameter. W is used to estimate the performance of a hypothetical set of indexes Inew against the current index set I. This representation is similar to the one kept for query set Q in the static index selection. In this case, when a new query q arrives, we determine which of the current indexes in I most efficiently answers this query and replace the oldest query in W with the abstract representation of q. We also incrementally compute the attribute sets that meet the input support and confidence over the last w queries. This information is used in the control-feedback-loop decision logic. The system also keeps track of the current potential indexes P and the current multidimensional histogram H.

System Output

In order to monitor the performance of the system, we compare the query performance using the current set of indexes I to the performance using a hypothetical set of indexes Inew. The query performance using I is the summation of the costs of queries using the best index from I for the given query. Consider the possible new indexes Pnew to be the set of attribute sets that currently meet the input support and confidence over the last w queries. The hypothetical cost is calculated differently based on the comparison of P and Pnew, and the identified best index iq from P or Pnew for the new incoming query:

Given Input:


Query workload



Expected Output:

Potential Index

Query set

Multidimensional Histogram

¨ Calculate the query Cost:

Given Input:

Potential Index

Query set

Multidimensional Histogram

Expected Output:

Suggested Indexes.

¨ Select Best Indexes:

Given Input:

Suggested Indexes

Expected Output:

Best Indexes.

Calculate the performance:

Given Input:

Incoming queries


Expected Output:

Performance of the System

[1] R. Bellman, Adaptive Control Processes: A Guided Tour. Princeton Univ. Press, 1961.

[2] R. Weber, H.-J. Schek, and S. Blott, “A Quantitative Analysis and

Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. 24th Int'l Conf. Very Large Data Bases

(VLDB '98), pp. 194-205, 1998.

[3] S. Ponce, P.M. Vila, and R. Hersch, “Indexing and Selection of Data Items in Huge Data Sets by Constructing and Accessing TagCollections,” Proc. 19th IEEE Symp. Mass Storage Systems and 10th Goddard Conf. Mass Storage Systems and Technologies, 2002.

[4] A. Shoshani, L. Bernardo, H. Nordberg, D. Rotem, and A. Sim, “Multi-Dimensional Indexing and Query Coordination for Tertiary Storage Management,” Proc. 11th Int'l Conf. Scientific and Statistical Data (SSDBM '99), 1999.

[5] S. Berchtold, D. Keim, and H. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Int'l Conf. Very Large Data Bases (VLDB '96), pp. 28-39, 1996.

Module Diagram:

Module 1:

Module 2:

Module 3:

Module 4:


UML Diagrams

System Architecture:

Activity Diagram:

Component Diagram:

Collaboration Diagram:


Future Enhancement:

In this project we are finding the index based on frequent item set mining, if the no. of frequent item set increase then no of computation also get increase. In my future work is for to reduce the no of computation while computing the index.