This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
The use of web applications has become increasingly popular in our routine activities, such as reading the news, paying bills, and shopping on-line. As the availability of these services grows, we are witnessing an increase in the number and sophistication of attacks that target them: the theft of personal information and bank funds and other high-tech crime cases can be heard here and there . All the above is the reflection of the information disclosure of the web application back-end data-base. Web application security largely depends on the back-end data-base security.
The data clients request is usually extracted by performing queries assembled using input provided by the users of the applications. If user input is not sanitized rigorously, it is possible to mount a variety of attacks that leverage web applications to compromise the security of back-end data-bases. Unfortunately, due to development time constraints, Web application programmers mostly focus on the web application functionality to serve end-users, but with little regard for the security of programs.The result is that poorly-developed code, riddled with security flaws, is deployed and made accessible to the whole internet. For example, one program may have regardless of the rigorous validation and integrity constraints of user inputs, which gives attackers an opportunity to exploit this flaw: one can change the structure of the query and query original meaning through crafting user input. If those happen, attackers can unauthorized manipulate the back-end data-base, steal secret information about users. All these will result in the destruction of the data-base and data information leakage and so on, leading to serious consequences.
Now data-base protection systems mostly work as bypass equipments, but it will be better if malicious operations of remote user can be blocked directly. The DB-FW we present here is just for the above thinking: it listens the queries are being executing, analyzes them, and blocks the query requests directly if it is unsafe.
II. ATTACKS AGAINST WEB APPLICATION BACK-END DATA-BASE
A. SQL Injection Attacks
SQL injection refers to a class of code-injection attacks in which data provided by the user is included in a SQL query in such a way that part of the user's input is treated as SQL code, thus changing the structure of an SQL query or changing the meaning of the query in an undesirable way. For example, a web application that lets the user list all her registered credit cards of a given type. The pseudo code for this functionality might be as follows:
Result=SQL ("SELECT nb FROM creditcards WHERE user='"+username+"'AND type='"+cctype+"' ;"); Print (Result).
If the user A needs to list all the VIP creditcards, the following SQL query will be executed: SELECT nb FROM creditcards WHERE user = 'A' AND type =' VIP '. There is SQL injection vulnerability in this code. If A wants to view all the creditcards belonging to user B, he could ask for a list of cards of type' OR user ='B. This would cause the following query to be executed: SELECT nb FROM creditcards WHERE user=A' AND type='' OR user='B'; this query returns a list of all B's credit cards to the attacker, causing the user B's information disclosure.
Cross-site Scripting Attack
Two main classes of XSS attacks exist: stored attacks and reflected attacks. In a stored XSS attack, the attacker persistently stores the malicious code in a resource managed by the web application, such as a data-base. The actual attack is carried out at a later time, when the victim requests a dynamic page that is constructed from the contents of this resource. For the dynamic Web sites relying on user inputs in URLs or Web forms to assemble queries dynamically, a malicious script can be hidden in URLs or in Web form inputs. Often these kinds of attacks are used to steal login credentials or other personal information from back-end data-base which can be considered an attack against the data-base   .
If the data submitted by the users of a web-based application is inserted into a data-base, cross-site scripting attempts can be observed at the data-base level and since the malicious scripts are visible in the SQL queries when the data is inserted into the data-base, it is possible to detect this kind of XSS attempts by observing all values as they are inserted   .
C. The Data-centric Attacks Against the Back-end Data-base
Such attacks can also be realized by crafting the user input strings. They are as follows: the first kind, which is often the case that a certain Web form field should only take on limited types of values. A user type field might have the values of Teacher or Student. If a user type of xxx is seen, this might be evidence of an attack  .
The second kind of attack needs two steps: the attacker inserts a specially crafted string into the data-base firstly; it will cause an SQL injection when it is processed at a later time. For example¼Œa website allows users to sign up with whatever username they desire. The web site periodically deletes inactive users with the following script:
old = now () - 3 months
users = sql ("SELECT uname FROM users
WHERE last_login < "+old+";")
for u in users: sql ("DELETE FROM users WHERE uname='" + u + "' ;")
This code is vulnerable to a two-step SQL injection attack. The attacker first creates a user named' OR '1'= '1. Assuming the user creation code is free from SQL injection vulnerabilities, the system correctly creates a new user with the following SQL statement: INSERT INTO USERS VALUES ('\' OR \'1\' = \'1'). The true attack is executed when the periodical cleanup script is run and the script tries to delete this user. Because of the carefully selected username, the script generates the following query to delete the user: DELETE FROM users WHERE uname='' OR '1' = '1'; since the expression '1' = '1' is always true, this statement would delete all users in the data-base. The two kinds of attacks above can be considered as attacks against back-end data-base, and can be detected by analyzing the SQL query statements and user inputs.
III. Related Work
For some Web applications, security of the back-end data-base information is crucial, including high-profile companies and associations, such as Travelocity, FTD.com,
Creditcards.com, RIAA and so on. So we must be with special emphasis on the back-end data-base protection¼Œwhich should begin with analyzing whether the query requests to the back-end data-base is anomaly or not , and study the tool that detects and blocks the attacks.
 propose two SQL injection detection methods. They are both virtual data-base connection drivers. SQLBlock in  can be considered as the prototype of a data-base firewall. GreenSQL  is an open-source data-base firewall which is configured to work in a variety of modes, but it has high false positives due to the detection method.
To determine whether the query requests to the back-end data-base is anomaly or not, we need to analyze the SQL query to be executed.  proposes a new technique for detecting SQL injection attacks, which combines static and dynamic analyzing method, but it has to modify the Web application source code.  presents a system that learns the profiles of the normal data-base access performed by web-based applications using a number of different models which allow for the detection of unknown attacks with reduced false positives and limited overhead. We refer to the statistical models proposed in this paper. Our DB-FW refers to the existing anomaly SQL query detection method and makes some improvement based on the GreenSQL.It analyzes the SQL query from both structure and user inputs. Users can also configure it to work under different ways.
IV. The basic work mode of our DB-FW
The data-base firewall we proposed here is to protect against web application back-end data-base attacks. It works as a proxy, which means that SQL query requests received from the client will first be passed to our DB-FW rather than the original data-base server. The DB-FW analyzes the request: the queries which are analyzed to be anomaly will be blocked by our DB-FW and an empty result will be returned to the client. Otherwise, it will call original data-base server to execute the query.
Figure1 is the architecture of the data-base firewall. It is basically a virtual data-base connectivity driver. By default
Figure 1. DB-FW Architecture
it listens on local port 127.0.0.1:3305 redirecting SQL requests to 127.0.0.1:3306 (the default MySQL setting) .It can work under different ways through configuring it working in the different modes.
A. Data-base Intrusion Protection Mode
In this mode, the system will block the SQL query and return an empty query to the client if it is analyzed anomaly using the heuristic anomaly detection method, else the system will call the original data-base server to execute the query.
1¼‰Heuristic anomaly detection method
Using heuristic anomaly detection method, we must first define some anomaly features of the SQL statements and set a score for each feature. Some of the SQL query anomaly features are as follows: some administrative queries including client query requests are used to create or delete tables and data-bases, change structure of the data tables; some queries which will lead to the sensitive information disclosure if executed, such as queries used to display sensitive data tables, data-base server version (e.g.:version ()) and the current user information(e.g., current_user); others like there is one comment inside SQL query statements, an empty password string, an 'or' token inside a query, an SQL expression that always returns true (SQL tautology) in a SQL query statement. Some of the anomaly features and their corresponding scores are as TABLE I.
Table I. Corresponding Score of Anomaly Features
Access To Sensitive Information
Exist Comment In SQL Query
Expression Always Returns True
Comparison Of Two Constants
There may be many anomaly features (T1, T2 ,T3, T4 ,â€¦,Tn) in one SQL query statement, corresponding anomaly scores are S(T1), S(T2), S(T3), S(T4), â€¦, S(Tn), the sum score of the anomaly features in one SQL statement, which we call anomaly score of the query: SUM(query)=S(T1)+S (T2)+S(T3)+S(T4)+â€¦+S(Tn). It also defines the threshold of the anomaly score T (By default T=30, it can be reconfigured) of the query. In order to find anomalies, our firewall uses its own SQL language laxer to find SQL tokens. It calculates anomaly score for each listening SQL query; any SQL query will be blocked if the anomaly score exceeds the threshold anomaly score that has been defined, e.g. the SQL query statement Q1:
Select * from user where name=''or '1'='1'and pwd=''
For the statement above, there are 3 anomaly features and anomaly score as TABLE II:
Table II¼Ž Corresponding Score of Anomaly Features
Anomaly Features In Q1
exists expression always returns true: 1=1
exists empty password: pwd=''
Anomaly score for Q1
For the statement above, There are 3 anomaly features: T1:exists expression always returns true:1=1;T2:existsemptypassword:pwd='';T3:have'or'token.SUM(query)=S(T1)+S(T2)+S(T3)=30+30+5=65>30.The query will considered being anomaly and be blocked by our database firewall.
In learning mode, the firewall learns normal data-base accessing features of the legitimate users: SQL query statement structure and the user input constant characteristics.
1) Normalization and Data Statistical Models
a)Normalization SQL query statements will be first abstracted as a sequence of tokens. Each token has a flag, which indicates whether the token is a constant or not. Constants are the only elements of an SQL query that should contain user supplied input. Tokens representing user inputs are augmented by a datatype attribute, including string and data here. In general, as specified by SQL standards, strings variables are single quoted, numbers are not quoted. In our design, we consider quoted numbers to be strings as well. And then a feature vector is initialized through extracting all tokens marked as constant and inserting them with their attributes into a list in the order in which they appear in the query . Then a query structure is generated by replacing all occurrences of constants in the SQL query statement with an empty place holder. The query structure characterizes the structure of the SQL query statement. We record the query structure in another list, which we called query structure whitelist, e.g.for the query:
SELECT x FROM y WHERE a='sdd' AND b=92 OR c='str1'
We extract all tokens marked as constant and insert them with their attributes into a list in the order in which they appear in the query, which result in the following feature vector (here we didn't mark the attribute of the constants):
Feather Vector¼šsdd 9 str1 The query structure is: SELECT x FROM y WHERE a='place holder1' AND b= place holder 2 OR c=' place holder 3'. For a lot of queries:
SELECT * FROM user_table WHERE user_id='john' and password='mysecret'
SELECT * FROM user_table WHERE user_id='mary'and password='love777'
SELECT * FROM employee WHERE name='john'
SELECT * FROM assets WHERE price>5000 and category='computer'
We get the following feature vector list:
ID1:SELECT * FROM user_table WHERE user_id='place holder 1' and password='place holder2'
ID2: SELECT * FROM employee WHERE name='place holder 1'
ID3: SELECT * FROM assets WHERE price> place holder 1 and category='place holder 2'
Each feature vector has its corresponding query structure, and in turn each query structure must have its corresponding feature vector. Lots of feature vectors may correspond to only one query structure: such as list of feature vectors above, the feature vectors 1, 2 correspond to only one query structure. And john is an element of the feature vector 1.
b)Data Statistical Model We can get the SQL query structure and feature vector after normalization. As we know, user input must appear in constants, different user inputs should result in the same SQL query structure. An SQL injection may change the query structure and have a different one. But attackers can also change the original meaning of the query without changing the structure of it. All this can be done through crafting user inputs, so it is necessary to analyze the user inputs of the SQL query. The user input corresponds to the feature vector elements after normalization. We know that there are only 2 types of the elements: string and number. We characterize the two types of elements using different statistical models. Here we use the models proposed in : they are String Length, String Character Distribution model, String Prefix and Suffix Matcher model, String Structure Inference model, which are five string-based models plus a data type independent model: Token Finder model. These statistical data models are mainly used to characterize the user inputs, allowing for the detection of uncommon attacks.
2)The two phase in learning mode
In training phase, the FW listens SQL queries from clients, and record script-name of the SQL query (The reason script name should be recorded is that one query may be legal in one Web application, but may not in another one.). And then, the system normalize the SQL query statement, adding the query structure to the query structure whitelist as well as initializing some data statical models for each element of the feather vector that corresponding to the SQL query according to the type of the element. We call the map in which recording the script-name of the SQL query, the structure of it and the models characterize the elements of the corresponding feature vector, a SQL query profile. Lots of query profiles we call a profile set. The system listens SQL queries from the client continuously in training phase, and normalizes them. If a profile of the query is found for the current script-name/structure combination, then each element of the feature vector is fed to its corresponding models in order to update the models' "sense" of normality. If not, a new profile is created and inserted into the profile set data-base. A profile is created by instantiating a set of models for each element of the feature vector. The type of models instantiated is dependent on the data type of the element. For example, an element of type char is associated with models suitable for modeling strings, while an element of type int would be connected to models capable of modeling numerical elements. Here we only define models for two types of elements: string and number. The elements quoted are considered as string type. In order to make sure the accuracy and integrity of the SQL query profile set, the system must listen legitimate SQL queries as many as possible in training phase.
E.g. for the SQL query SELECT a FROM b WHERE x='me' AND y=9OR z='ch'¼Œthe process above is as Figure 2.
Figure 2. Normalization of a query statement
b)Determining threshold of anomaly score
The system still needs to normalize the listening SQL queries, and looks for the profile the same way as previous phase. But the feature vector in this phase is not used to update the models. Instead, the models are used to generate a score that measures how well the feature vector fits the models. An aggregated score called anomaly score of the query is calculated as the sum of the negative logarithm of each individual model score as in . For each profile the highest anomaly score seen during this phase is recorded, which is called the threshold anomaly score of the query. If no profile is found, the firewall control center will create a new profile for the query automatically.
C. Anomaly Detection Mode
The system works in this mode still normalize the SQL query and then check whether the query structure is in the query structure whitelist. If not, the firewall will block the query; else the system will calculate an anomaly score of the query in the way similar to the previous phase, but if the anomaly score exceeds the threshold anomaly score recorded in the previous phase by a certain percentage, the query will be blocked. The system working process in this mode is shown in Figure 3:
Figure 3. Working process in Anomaly Detection Mode
V. Implementation and Evaluation of the System
Our DB-FW system is developed in Linux environment. It is implemented based on the existing open-source data-base firewall-GreenSQL. But we make some improvements on the detection method. The port setting is basically the same as GreenSQL. The evaluation of the system is also under the Linux environment,too.
A. Experiment Configuration
Our test server is a 2 GHz Pentium 4 with 1 GB of RAM Linux 2.6.X. The server was configured with an Apache web server (v2.0.52), the MySQL data-base (v4.1.8), and Tomcat server.
 proposed a SQL injection attack detection technique, which combines static and dynamic analyzing methods. For the performance evaluation, it developed a tool, Amnesia, which implements the technique and used the tool to evaluate the technique on seven Web applications. In the evaluation it targeted the subject applications with a large number of both legitimate and malicious inputs and measured how many attacks the technique detected and prevented. We used the seven Web applications here for the DB-FW performance evaluation and refer to the test bed.
We should change the web application (Java+MySQL+Apache) settings here. For we have access to the application code we just need to add 3305 as a MySQL connection string.
B. Performance Evaluation
For performance evaluation, we configured it working in two different ways: when working in the first way, the DB-FW only needs to detect the anomaly queries and block them using heuristic anomaly detection method; when working under the second way, it first has to learning the normal access to the data-base. The aim is mainly to evaluate our firewall performance working under different ways, including the ability to block anomaly SQL queries and the overhead of the system. We also create some uncommon attacks manually to check whether the firewall can block uncommon attacks in the two ways.
1) The first way
When the DB-FW working under this way, we just need to configured it working in data-base intrusion protection mode, we also need to change the web application setting here. We referred to the testbed proposed in . We also produced a consequence of unknown attacks by manually operating the web sites to check whether our firewall can prevent new types of attacks when working in this way. The results are as TABLE III:
Table III¼Ž Results when working in the first way
Size¼ˆLines Of Code¼‰
The above web applications are all commercial applications, they are different in size. These applications are not large, so we can make comprehensive understanding of the applications and so make it easier for the test data producing. The (legitimate+macious) means that number of legitimate and macious query requests that have been executed, we produce different number of legitimate and macious queries for each web application due to the vulnerabilities in the web application and to produce different kinds of attacks (known and unknown attacks) against back-end data-base for each w b application, the legitimate queries and some of the macious queries we refer to the testbed in . Besides, we produce another set of macious queries manually for each web application. Prevent means the number of requests that have been blocked by the FW; False negatives means that the number of attacks that haven't been blocked. From the results we can see that the false negative is relatively high when it works under this way.
2) The second way
When working under this way, we first need to configure the FW working in learning mode, after the learning mode, our firewall works in anomaly detection mode automatically. To learn the profile of the normal data-base access performed by web applications comprehensively and accurately, we must produce query requests as many as possible in learning mode. Here we produce another certain number of attack-free query requests manually for each web application besides the legitimate data sets used in Amnesia. These data sets are used in two phases of the learning mode. The four web applications are not large in size, so it is easier to learn the profile of the normal data-base access accurately. In the detection mode, we use the same datasets used in the first way. The number of queries used in the two phases of learning mode and the experiment result are as table IV¼š
Table IV¼Ž Results when working in the second way
Size¼ˆLines Of Code¼‰
Anomaly Detection Mode¼ˆLegitimate+Macious¼‰
From the result we can see that, using the same testbed, the false negatives of the firewall working under this way is lower than that in the first way. It is due to the detection method we used here which can not only block the known attacks but also the unknown ones against the back-end data-base.
C. The Performance Overhead
As we know, our DB-FW works as a SQL proxy. As a result, the system adds an additional layer to the existing application infrastructure. In addition, it creates another network socket when connecting to the original MySQL server, from the network point of view. All these add some overhead . We test the system performance overhead using the speedy configuration. Application logs were disabled. We use Apache Bench tool to measure response time of the web site and number of requests executed per second. Here, for the web application Bookstore, we use 500 legitimate requests of the testbed, and send some requests from the client concurrently; we perform roughly measurement of the response time of the website's homepage in different ways. We also test the response time when there is no FW working.Results are as Figure 4:
Figure 4. Performance overhead test
The above are the results, we tested web application, that makes heavy usage of database, we get to performan-
ce decrease of 4-16 % (depends on the number of concurrent requests). For most of the sites, when our firewall is in use, you will feel almost no changes.
VI. Conclusion and future work
In this paper, we propose a data-base firewall which can block the known and known attacks against the web application back-end data-base. It can work under different ways through different mode configuration. From the evaluation, we can see that the false negatives and the false positives are both very low and the performance overhead is relatively low. We emphasize here is that we can not learn every executing query requests to the web application during the learning mode, but we still can ensure the accuracy of the profile sets that have been learned. For one thing, we only need to learn the query structures and characterize the user inputs in the query; all these are a process of inducting and generalizing. For the other thing, there are two phases in learning mode which also improve the accuracy of the profile set defined in learning mode. When the firewall detects an attack, it returns an empty result to the client rather than the error information, which not only can avoid the disclosure of the data-base information, but also ensure the execution of the next query normally.