# Implementation Of Automata Theory In Database Engine Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Automata Theory is the study of the abstract mathematical models of machines and the problems they are capable of solving. Automata are used to study and test the limitations of computation and focus on the decidability and the intractability of computers. Decidability tests whether or not a problem can be solved while intractability determines the efficiency of the computer with respect to the increase in size or complexity of the problem. An automaton is a self operating electronic or mechanical device. An automaton consists of a finite set of internal states which accept and respond to an external stimulus to produce an output based on the transition rules applied to the internal states.

Simple Automaton

Figure 1 is an example of a simple automaton. It accepts an input at s0 or the start state and switches to a new state within the automaton based on the evaluation of the stimulus. The output of each state determines the next state of the automaton. Figure 1, can be classified as a finite automaton as there are a finite number of possible states (4) of the automaton. Finite automata define a class of automata for which there exists a finite number of states, transitions and actions.

Figure 1

The main concepts of the automata theory are alphabet, strings and languages. The alphabet may be defined as some finite set of symbols; strings as a finite sequence of symbols derived from this alphabet and the language as a set of strings such that all the symbols of the strings are derived from this alphabet. In Figure 1, the alphabet is {A, B, C, D} while the strings, as shown could be {AB}, {C D} or any combination of members of the alphabet. Alan Turing attempted to define the boundaries of computing machines and created Turing machines which exist as the most general form of automata. These machines could compute the same types of algorithms as modern computers and as such, his tests and theories still apply to modern computation. Turing machines can simulate the functionality of modern computers which implement a category of automata called finite automata. These machines implement algorithms defined in regular languages and grammars and are employed in the design and implementation of modern computing languages and software solutions.

Grammars

Precisely defining natural languages has long been regarded as the primary work of linguist Noam Chomsky. Chomsky sought to use mathematical representation to create a structured, flexible and defined model for natural languages. One of his approaches, context-free grammars, was later adopted for modeling computer languages where a grammar may be defined as: â€¦ A quadruple (Î£; V; S; P), where: 1. Î£ is a finite nonempty set called the terminal alphabet. The elements of Î£ are called the terminals. 2. V is a finite nonempty set disjoint from Î£. The elements of V are called the non terminals or variables.

3. S âˆˆ V is a distinguished non terminal called the start symbol. 4. P is a finite set of productions (or rules) of the form ð›¼ƒ  ð›½

where Î± âˆˆâˆª V *V âˆª V * is a string of terminals and non terminals at least one non terminal and Î² is a string of terminals and non terminals. [1]

There are four distinct types of grammars: type 3 or regular grammars, type 2 or context free grammars, type 1 or context sensitive grammars and type 0 or unrestricted . A Context-free grammar (CFG), more generally called a grammar is such that every production is represented by the Backus-Naur form: ð‘‰ƒ  ð‘¡ Where V is a non terminal and t is a terminal symbol from the language's alphabet. Context free grammars allow for flexible syntax structures and the definition strings in natural language. This is then converted into the strict mathematical structures envisioned by Backus Productions in CFG are generally in one of the two approaches; recursive inference or derivation. In recursive inference, strings of the language body variables are concatenated with any terminals in the body. From this it is inferred that the resulting string is in the language of the variable in the head. Derivation involves expanding the start symbol and all subsequent strings by substituting variables with a production until the string is comprised entirely of terminals. Derivation grammars exist as either Leftmost (LL) grammars or rightmost LR grammars. In LL grammars, the production rules are applied on the left most variable in the body of a production. The leftmost variable must always be expressed as terminals before proceeding since left recursion is not permitted. The production is resolved using the lookahead values LL(k) specified in the grammar. Most LL grammars use a value of LL(1), with localised values to resolve ambiguities. In LR grammars, the converse takes place. The grammar is read from left to right; however the production rules are applied from right to left such that the rightmost variable must always be firstly resolved to a terminal. Like LL grammars, the default lookahead value is LR(k), however, localised values can be used to resolve ambiguities. LR grammars, are relatively easy to implement and can detect syntactic errors more readily than LL grammars, however LL grammars are more easily written.

Database Systems

Database

Data persistence at some level has always been a requirement in programming. Early primitive systems stored data in ad-hoc flat files. These databases were designed for purpose, and with minimal interoperability, required a recompilation of client programs subsequent to any semantic or syntactic changes in the database. The lack of optimization, which was inherent in these databases, resulted from the challenges associated with their fragmentation and redundancy. A database is a formally defined, organized collection of related data about the real world which can be accessed, shared and manipulated by users. In order to achieve this, modern databases are required to store data in a defined format, as prescribed by its related metadata and schema. The data must be stored independent of programs and users while remaining fully secured and accessible to both. The Database Management System (DBMS), the component responsible for this, further manages disk storage and optimization techniques for fast and efficient data management.

Relational Database

The Relational Database (RDBMS) exists as the most common database, with its basic object being the table or relation. Each table is structured with a number of rows (tuples) along with columns (attributes) with values describing each row. The relational model enforces constraints such as unique identifiers, required attributes and table referencing where a tuple ð‘ is related to some other tuple ð‘€ in another table. Through relational algebra, it supports the five basic operations of union, selection, difference, projection and product. However, most RDBMSs implement a higher level language such as SQL to allow users to express queries in a more meaningful manner. Common RDBMSs include Oracle, Microsoft SQL Server and Postgre SQL

STRUCTURED QUERY LANGUAGE

SQL

Structured Query Language (SQL) was developed by IBM in the mid 1970's as means of querying relational databases. SQL, over a short period of time, had become the de facto query language for relational database management systems. It subsequently gained acceptance by the American National Standards Institute and the International Organization for Standardization in 1986 and 1987 respectively. Further revisions were adopted in 1989, 1992, 1999, 2003 and 2006. This scope of this paper is restricted to ANSI compliant SQL 1992. As a declarative language, SQL remains confined to "what" instead of "how" when querying databases. The advantage being simplicity as the user defines the conditions for the data set to be returned, but almost always never provides any instructions on how the data is to be retrieved. This task is undertaken by the database engine and varies with the Database Management Software. SQL can be subdivided into two main parts; the Data Definition Language (DDL), and the Data Manipulation Language (DML) for database management.

The DDL is used to define the database and its constituent objects. It is not involved in data manipulation and includes the commands: ð¶ð‘…ð¸ð´ð‘‡ð¸,,,ð‘Žð‘›ð‘‘ð·ð‘…ð‘‚ð‘ƒ. DDL is restricted to database administrators is used to manipulate the database objects and schemas and to set constraints, relationships, indices and namespaces.

The DML is the more commonly used sub-language and is routinely used by all database users for inserting, retrieving, manipulating data. The DML includes the commands: ð¼ð‘ð‘†ð¸ð‘…ð‘‡,,ð¿ð¸ð‘‡ð¸ and the most frequently used ð‘†ð¸ð¿ð¸ð¶ð‘‡. This project will be focused on the translation of the ð‘†ð¸ð¿ð¸ð¶ð‘‡ statement subset of the DML sub-language.

Select Statement

The select statement comprises of three distinct parts: select, from and where. These parts can be directly linked to the structure of relational algebra constructs, where the select clause is the implementation of the projection operation. The from clause is the implementation of the Cartesian product operation. The where clause corresponds with the selection predicate of the relational algebra involving the attributes of the relations expressed in the where clause (13).

For two collections M and N, both having attributes {ð‘Ž,,} a typical SQL query, selecting {ð‘Ž,ð‘,ð‘} from M and N joining on attribute {ð‘} would be expressed as:

## ð‘†ð¸ð¿ð¸ð¶ð‘‡ð‘Ž,ð‘,ð‘

Attributes- Select Clause

## ð¹ð‘…ð‘‚ð‘€ð‘€,ð‘

Relation - From Clause

## ð‘Šð»ð¸ð‘…ð¸ð‘€.ð‘=ð‘.ð‘;

Predicate - Where Clause

The from clause is firstly evaluated, followed by the where clause and finally the select clause.

Select Clause

The select clause contains the list of attributes or columns that will be returned from the underlying query. The standard select statement implies a ð‘ ð‘’ð‘™ð‘’ð‘ð‘¡ð‘Žð‘™ð‘™ operation aimed at reducing the cost overheads of eliminating duplicates from result set. The ð‘‘ð‘-ð‘ ð‘¡ð‘-ð‘›ð‘ð‘¡ function can be optionally applied to remove duplicate tuples.

It may also contain aggregation and grouping functions that are applied to the attributes before projection. These include numeric operators +, -, * and / as well as aggregators: sum, avg, min, max and count. SQL allows for aggregation to be applied between attributes or a combination of attributes and real numbers (e.g. 1.2, 2.5) or integers (e.g. 1, 2, and 3). A query, returning the average of the sum of the attributes {a, b} of all tuples in M may be expressed as:

## ð¹ð‘…ð‘‚ð‘€ð‘€;

In order to return all the attributes, the asterisk symbol could be substituted for attributes names indicating that all the attributes resulting from the from clause be returned by the query. This is expressed as:

## ð¹ð‘…ð‘‚ð‘€ð‘€;

From Clause

The from clause of the select statement contains the relations to be scanned while evaluating the underlying query expression. An example is as follows:

## ð¹ð‘…ð‘‚ð‘€ð‘€,ð‘

When expressed as a statement with only select and from clauses, the from clause is evaluated as the cartesian product of all the relations. Further filtering can be achieved by explicitly defining joins or filtering in the where clause. Nested queries are fully supported in the from clause and like the relations, can be manipulated within the query.

Where Clause

The where clause of the select statement is the filtering clause allowing a query to return those tuples which are of interest to the user. It is expressed as follows:

ð‘Šð»ð¸ð‘…ð¸ð‘€.ð‘Ž> 5 ð‘Žð‘›ð‘‘ð‘.ð‘<10

This clause follows the from clause and consists of filtering conditions using the logical connectives (and, not and or) as well as string and numeric comparison operators (<,>,=,>=,<=,<>). Both string and numeric expressions can be evaluated as well as special types such as dates (13). Nested queries are fully supported in the where clause and like the relations, can be manipulated within the query.

Aggregation

SQL provides for five built in aggregation functions: avg, min, max, sum and count. Aggregation is supported across different data types including strings and numeric types. However, the functions avg and sum can only operate on numeric data types. A query, returning the average of the sum of the attributes {a, b} of all tuples in M may be expressed as:

## ð¹ð‘…ð‘‚ð‘€ð‘€;

This query produces a result set containing a single value for the average value. The aggregation function is more commonly used with group by functions to produce averages based on a common grouping of tuples instead of the entire relation. The retention of duplicate tuples will affect the accuracy of any aggregation function executed. In relations, this maybe done on a commonly repeated attribute, such as a sales item attribute in a sales table, the removal of any sales item tuple based on this attribute could result in an inaccurate aggregate returned. Duplicates however can be eliminated using the distinct function when needed. The distinct function, while legal by the SQL BNF, effects no changes to the result of max and min and is not permitted with the count (*) function.

Where grouping is used along with an aggregation, further filtering can be accomplished by means of the ð‘•ð‘Žð‘£ð‘-ð‘›ð‘” clause. This clause allows a higher level of filtering based on the values returned and unlike the where clause, filters on groups of tuples rather than individually.

Nested Queries

SQL provides full support for the arbitrary nesting of sub-queries. A sub-query is a select-from-where expression that is nested within another query (13). The nested query may be used in the from clause in instead of a schema relation. The result set produced from the execution of the nested query provides the tuples for the outer query. A query with nesting in the from clause may be expressed as:

## ð¹ð‘…ð‘‚ð‘€ð‘€

ð‘Šð»ð¸ð‘…ð¸ð‘€.ð‘Ž>10

## ) ð‘Žð‘ ð‘‡;

When nested in the ð‘“ð‘Ÿð‘œð‘š clause as a derived relation, the nested query must be renamed using the ð‘Žð‘  clause and the attributes may also be renamed to avoid ambiguities in the result set. This renaming can occur within the derived query or in the select clause of the outer query.

SQL further allows for testing set membership using the ð‘-ð‘› or ð‘›ð‘œð‘¡ð‘-ð‘› functions. This is typically employed in the ð‘¤hð‘’ð‘Ÿð‘’ clause for filtering. An example of testing for test membership is:

## ð¹ð‘…ð‘‚ð‘€ð‘€

ð‘Šð»ð¸ð‘…ð¸ð‘Ž>10

## );

Where, for each tuple in M, its attribute is evaluated for membership in the set returned from the sub-query. The sub-query could return a single aggregate value and as such the where clause could evaluate this using the standard comparison operators. Such an example would be:

## );

Set Operators

SQL supports full set operations including union, intersection, union all and minus. SQL requires that the relations participating in the operation must be compatible. They must have the same number and types of attributes. The union and union all operators append both the results of both queries to produce a result set. The union query by default eliminates duplicate tuples, however union all may be used when needed to retain all tuples. A simple union query on all the attributes of M and N maybe expressed as:

## (ð‘†ð¸ð¿ð¸ð¶ð‘‡âˆ-ð¹ð‘…ð‘‚ð‘€ð‘€);

All the set operators can be similarly used. They can both be nested or contain full SQL nested queries.

These represent the basic structure of SQL statements.

Data Types

IQL like SQL supports integers (e.g. 0, 1, and 2) and Boolean types (e.g. True | False) as well as floats (e.g. 1.5, 0.012); implemented as Java primitives. It also supports strings enclosed in single quotes ('e.g. Jamie') and date time objects (e.g. dt '2007-08-14'), both of which are represented as Java String objects. Other data types supported include lists (e.g. 1, A, C), bags and sets

For example the following SQL UNION ALL statement: ð‘†ð¸ð¿ð¸ð¶ð‘‡ð‘-ð‘‘,ð‘…ð‘‚ð‘€ð‘ ð‘¡ð‘Žð‘“ð‘“ð‘Šð»ð¸ð‘…ð¸ð‘-ð‘‘> 5; ð‘ˆð‘ð¼ð‘‚ð‘ð´ð¿ð¿ð‘†ð¸ð¿ð¸ð¶ð‘‡ð‘-ð‘‘,ð‘›ð‘Žð‘šð‘’ð¹ð‘…ð‘‚ð‘€ð‘ ð‘¡ð‘¢ð‘‘ð‘’ð‘›ð‘¡; ð‘Šð»ð¸ð‘…ð¸ð‘-ð‘‘> 5;

The second list or bag is simply appended to the first maintaining duplicate tuples. The SQL ð‘ˆð‘ð¼ð‘‚ð‘ has an equivalent operator in IQL. Where a query such as ð‘ˆð‘ð¼ð‘‚ð‘ð´ð¿ð¿ returns duplicates, the DISTINCT function ð‘‘ð‘-ð‘ ð‘¡ð‘-ð‘›ð‘ð‘¡ð‘ ð‘’ð‘¡1 ++ ð‘ ð‘’ð‘¡2 can be used to remove duplicates and create the equivalent UNION result list. Alternatively the ð‘ˆð‘ð¼ð‘‚ð‘ operator in IQL (ð‘ˆð‘ð¼ð‘‚ð‘ð‘ ð‘’ð‘¡1 ð‘ ð‘’ð‘¡2) creates an SQL ð‘ˆð‘ð¼ð‘‚ð‘ equivalent result set. The SQL INTERSECT statement: ð‘†ð¸ð¿ð¸ð¶ð‘‡ð‘-ð‘‘,ð‘…ð‘‚ð‘€ð‘ ð‘¡ð‘Žð‘“ð‘“ð‘Šð»ð¸ð‘…ð¸ð‘-ð‘‘> 5; ð¼ð‘ð‘‡ð¸ð‘…ð‘†ð¸ð¶ð‘‡

ð‘†ð¸ð¿ð¸ð¶ð‘‡ð‘-ð‘‘,ð‘…ð‘‚ð‘€ð‘ ð‘¡ð‘¢ð‘‘ð‘’ð‘›ð‘¡ð‘Šð»ð¸ð‘…ð¸ð‘-ð‘‘> 5;

SQL

UNION

UNION ALL

INTERSECTION

MINUS

The complete list of comparison operators as translated in the where clause is listed below

## =

in

not in

like

not

The group by clause is moved to the head of the comprehension and the grouping co column separated using parenthesis and commas. This is straightforward and requires simple rules, while making no changes to the body of the comprehension. Having identified the possible areas for attention in translating from IQL to SQL, the translator can be implemented based in these defined rules. The approach provides a suitable development environment and language to create an accurate and useable solution.

Supported Query Types

The translator was created to support as a large a subset as possibly from the SQL ANSI 1992 standard. The development time and effort needed, meant it was kept confined to a more focused subset. The subset of queries supported includes: The SELECT FROM WHERE GROUP BY structure. The select clause requires fully qualified column names of the form table.name. It allows for aggregation over the columns returned as well as arithmetic operations involving one or more columns. Aggregation and arithmetic functions supported are:

Sum

Count

Avg

Max

Min

## /

The FROM clause supports the use of both tables and nested queries. Queries can be arbitrarily nested; however nested queries with aggregate columns would return longer columns names since they cannot be aliased. Therefore a nested query which performs some aggregation on two columns {a, b} would have the resulting column referenced in the outer query as {a_b}. This holds true for nested queries in the WHERE clause. Correlated queries are not supported by the translator in both the FROM and WHERE clauses. The translator supports the standard WHERE clause, including set membership and comparisons. However it does not support the SQL functions "EXISTS" or "BETWEEN" as well as their negated forms. The GROUP BY clause is supported by the translator, however aggregation is only allowed over a single column and the HAVING clause is not supported. The translator supports the distinct function across the entire result set, however does not allow a count (*) as is used in SQL without specifying a column. The list below shows the general types of queries supported:

Simple SELECT FROM

Simple SELECT FROM WHERE (Evaluating Numbers)

Simple SELECT FROM WHERE (Evaluating Strings)

Nested Queries in WHERE Clause.

Set Operators in the Outer Statement

Nested Queries in the FROM Clause

Select With Aggregation

Select With Group By

Nested Set Operators

Aggregation over Grouping

Implementation of Automata in Database

Finite State Automata for SQL Query

Figure 2

The database server as a finite-state automatonhttp://i.msdn.microsoft.com/ms810436.servrapp_2%28en-us,MSDN.10%29.gif

Figure 3

We can rewrite the control flow as a diagram as depicted in Figure 3. Each circle is a state that the client-server communication can be in, and the arcs between the circles depict actions that the server performs to transform the communication from one state to the other. Note that at the end of each arc label there is an I/O call-either a read from or a write to the client, and furthermore, each I/O call is immediately followed by a state transition. Thus, with the information about the state the communication is in, the server knows exactly what to do with the I/O that just completed. Rewriting this diagram into code is easy.

State diagram of the SmallLibrary in SQL

Aa902653.smalllibrarysample_fig2(en-us,SQL.80).gif

Figure 4

Figure 4. State diagram of the SmallLibrary Web application, including notification services actions that have been added to the existing system.

Eliminating SQL Injection Attacks - A Transparent Defense Mechanism

The World Wide Web has experienced remarkable growthin recent years. Businesses, individuals and governmentshave found that web applications can offer efficient and reliablesolutions to challenges of communicating and conductingcommerce in the 21th century. Various corporate bodieswhose business model completely focuses on the Weblike Google, Yahoo, Amazon etc. have taken web interactions

to newer heights. As many enterprise applicationsdealing with sensitive financial and medical data turn online,the security of such web applications has come underclose scrutiny. Compromise of these applications representsa serious threat to organizations that have deployedthem, and also to users that trust these systems to store confidential data. The potential downtime and damages thatcould easily amount to millions of dollars have also prohibitedmany mission critical applications, which could greatly benefit users, from going online. Hence, it is crucial to protect these applications from targeted attacks.

However, the current state of application security leaves much to be desired. The 2002 CSI and FBI revealed that, on a yearly basis, over half of all databases experience at least one security breach and an average episode results in close to \$4 million in losses. A recent penetration testing study of more than 250 Web applications concluded that at least 92% of Web applications are vulnerable to some form

of malicious intrusions [2]. Recent U.S. industry regulations such as Sarbanes-Oxley Act, try to enforce strict security compliance by application vendors [3] and there is an urgent need to find means of satisfying these requirements. SQL Injection Attacks (SQLIAs) constitute an important class of attacks against web applications. SQLIAs can give attackers direct access to the database underlying an application and allow them to leak/alter con_dential information [4] or to even execute any malicious code [5]. There are many examples of SQLIAs with serious consequences, and the list of victims includes high-pro_le organizations, such as Travelocity, Tower Records, RIAA etc. [6]. The increasing number of web applications falling prey to these attacks is alarmingly high [7] [8] [9]. In fact, SQLIAs have been included in list of top 10 threats to web applications [10].

SQL Injection Attacks

In this section, we present a web application that is vulnerable to a SQLIA and explain how an attacker could exploit this vulnerability. We also discuss various other techniques that can be employed to gain illegitimate access to systems. Consider a typical web application in which an user on a client machine can access services provided by a web server, having a database backend, like an online email account.

When the user enters a login and a password in the web form and presses the Submit button, an URL is generated (http://foo.com/home.jsp?login=guest&pass=test) and sent to the web server. The user input is interpreted by the servlet home.jsp, which then in turn builds a dynamic SQL query, submits the query to the database and uses the response from the database to generate HTML-pages that are sent

back to the user. Suppose query in servlet page is of form;

If the login and password as provided by the user are used, the query to be submitted to the database takes the form;

SELECT * FROM user WHERE login='guest' AND pass='test'

A web site that uses this servlet would be vulnerable to SQLIAs. If the user were to enter [' OR 1=1 ô€€€ô€€€] and [ ] instead of [guest] and [test], the query would take the form;

SELECT * FROM user WHERE login=' ' OR 1=1 ô€€€ô€€€' AND pass=' '

The characters "- -"mark the beginning of a SQL comment, and anything beyond is ignored. The query as interpreted by the database now has a tautology and is always satis_ed; hence returning information about all users. Thus an attacker can bypass all authentication modules gaining unrestricted access to critical information on the server. An SQL Injection Attack (SQLIA) is a subset of the unverified/unsanitized input vulnerability and occurs when an attacker attempts to change the logic, semantics or syntax of a legitimate SQL statement by inserting new SQL keywords or operators into the statement. This definition includes, but is not limited, to attacks based on tautologies, injected additional statements, exploiting untyped parameters, stored procedures, overly descriptive error messages, alternate encodings, length limits, second-order injections and injection of .UNION SELECT., .ORDER BY. And .HAVING.clauses. A detailed explanation of different forms of SQLIAs and ways in which they can be exploited are available in the public domain [11] [12].The widely deployed defense today is to train the programmersand web-developers about the security implicationsof their code and to teach them corrective measures andgood programming practices, as outlined in [13]. However,rewriting or revising the entire lot of existing legacy code isnot an easy process and is not anancially viable option formany organizations. Even this does not guarantee any foolproofdefense and hence we need automated processes todetect the vulnerability and eliminate them. Various othertechniques like use of stored procedures [14], prohibitingdisplay of database server error messages and use of escape

sequences (available in PHP as Magic Quotes) for sanitizinguser inputs are employed as a quick fix solution. Unfortunately,even these security measures are inadequate againsthighly sophisticated attacks as outlined in [15]. Recently,better detection strategies like SQLIA signature detectionhave been proposed by IDS/IPS vendors [16], but their successis still limited to a small subset of the whole range of

attack mechanisms [17]. It is of even greater concern thatproducts like Microsoft SQL Server etc. provide attackersdirect access to command line shell, registry using methodslike xpcmdshell, xpregreadetc.

Related Work

Various SQLIA detection techniques have been proposed in literature but many of them suffer in terms of immediate usability and deployability. Many existing techniques, such as filtering, information flow analysis, penetration testing, and defensive coding, can detect and prevent a subset of the vulnerabilities that lead to SQLIAs. Techniques that employ input validation are prone to a large number of false positives and yet there is no guarantee that there are no false negatives. Safe Query Objects [18] and SQLDOM [19] use encapsulation of database queries to provide a safe and reliable way to access databases but they require developers to learn and use a new programming paradigm. SQLrand

[20] provided a radical shift in the way this problem can be approached using query randomization [21]. However, it could be circumvented if the key used for randomization were to be exposed.

Another popular mechanism has been static analysis of the code for vulnerabilities [22]. The Java String Analysis library [23] provides us with a mechanism for generating models for Java strings and can be extended to generate fairly accurate SQL-query models. JDBCChecker [24] [25] statically checks for the type correctness of dynamically generated SQL queries. Although these techniques are effective, they cannot capture more general forms of SQLIAs that generate syntactically and type correct queries. The authors use automated reasoning in [26] to detect tautologies in the dynamically generated SQL

queries, but other forms of SQLIAs still go undetected. Recently, researchers have been exploring the use of static analysis in conjunction with runtime validation [27] to detect instances of SQLIAs. In [28], the authors have proposed the use of parse trees to detect malicious user input. In [29] [30], the authors have used an automaton construction technique to defend against SQLIAs. However, these techniques still require modification of the application source code which may not be preferable to most developers

and their organizations in general. Also there is an additional runtime analysis overhead in terms of execution time which cannot be avoided due to the sequential nature of the analysis techniques. Also the element of access control is not captured in such models which can, in the theoretically worst case, still allow SQLIAs to occur for certain poor implementations of the application.

Contribution

SQLIA detection combining static & runtime analysis with following features:

1. No code modi_cation required, simple web server patch required (SQLIAs captured by altered data flow path).

2. Optimized runtime analysis using SQL-graphs, and SQL query validation in parallel, for faster webpage accesses.

3. SQLIAs using access control violations in the script and different character encodings also captured.

SQL-graph Representation

We can thus construct a SQL-FSM for each of the hotspots in the program. These data structures now capture the semantics of the different SQL queries that are to be sent to the database at runtime. Any user input would be compared against this template and any change in the SQL-FSM structure would indicate a possible SQLIA.We note that running each and every query under the scanner at runtime could be an expensive process. Given that the user input would realistically consist of a few strings only but the number of SQL queries that get executed in a program could be very large, we now try to optimize number of queries that need to be put under the scanner during runtime to ensure the validity of dynamically generated queries, using a SQL-graph.

Figure 5. NDFA and SQL-FSM for Hotspot

The SQL-graph in Fig. 6 represents 4 different SQL queries in the program as nodes within a logical boundary, and 3 different user inputs as being outside the logical boundary. If a particular user input (I) is used in a SQL query (Q), the relationship (R) between the two nodes is indicated by an undirected link between the 2 nodes. We now define dependencies (D) in the SQL-graph as links that point from one

SQL query to another SQL query such that the user inputs used by the former is a proper superset of the user inputs used by the latter. For SQL queries that use the same set of user inputs, one of them is chosen as a representative query and is made to point to the others. We see the dependencies

represented as directed arrows in the SQL-graph. Drawing equivalence to Code 1, Q1, Q2 and Q3 represent the 3 different SQL queries (also the 3 different hotspots in this case), while I1 and I2 represent the user inputs login and pass. Q4 and I3 could possibly correspond to some other hotspot in the program not represented in the code snippet.

Figure 6. SQL-graph Representation

The concept of SQL-graph is used to reduce runtime scanning overhead by restricting the number of queries that need to be scanned along any execution path that is taken in the program. SQL queries that do not use user inputs are not included in the SQL-graph. Only the SQL queries that are exposed to the user inputs in some form or the other (string manipulations included) are included in the SQL-graph epresentation. The choice of such a representation and the resulting bene_ts in terms of runtime verhead would be explained as part of Runtime Validation. 4.2. Runtime Validation During runtime, the SQL queries (with the user inputs embedded) are compared against the corresponding SQLFSMs

to check for their validity. If the user inputs cause the dynamically generated SQL queries to not conform to the semantics of the intended SQL queries as in the SQLFSMs, then they are _agged as SQLIAs, else they are passed through. Fig. 4 shows the case where an SQLIA is not caused and the query is passed through. Also, it shows the second example where an SQLIA has been caused and

hence gets rejected as a potentially malicious query. The literals along both the static SQL-FSM and the runtime SQL-FSM, as one traverses from the Start node to the End node, should be identical. The other check that can be enforced is that the length of the SQL-FSM chain for a particular instance is exactly the same for the static and runtime SQL-FSMs. Thus SQLIAs employing tautologies and injecting additional statements can be captured by this technique. The case where alternate encodings like URL Encoding,

UTF-8 etc. are used by the attackers can also be addressed by requiring the runtime validation to occur only after all the user input has been converted to a single encoding format as interpreted by SQL Engine in database server.

Figure 7. Program Execution Time w/ and w/o SQLIA Protection (Sequential/Parallel Architecture)