This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
This chapter proposes a benchmarking methodology as the research methodology for this study. 2 types of benchmarking methodology are proposed - microbenchmarking and synthetic benchmarking. In the beginning of this chapter, an overview of the benchmarking methodology is provided and followed by descriptions of the virtual machine setups used in this study; afterwards, CPU micobenchmark, disk microbenchmark, and network microbenchmark are discussed in details. TPC Benchmark H (TPC-H) is then proposed as a synthetic benchmark and discussed in details. A summary is provided at the end of this chapter.
In order to quantify the isolation issues and how they affect the performance of decision support systems in the IaaS public cloud, a benchmarking methodology is proposed. A benchmark is defined as a program coded in a specific language and executed on the machine being evaluated (Lucas, 1971). Benchmarking is a common practice for system evaluation when there is a need to highlight certain characteristics of the system or to compare the system with other systems (Trancoso, Adamou & Vandierendonck, 2005).
Since the performance of virtual machines in the IaaS public cloud is likely to fluctuate overtime due to isolation issues, and require a long period of observation in order to gain accurate, meaningful, and comparable results, the benchmarking methodology is most appropriate. Benchmarks can be developed to run over a long period of time using predefined parameters and require little supervision from humans.
To quantify the performance of decision support systems in the IaaS public cloud, 2 types of benchmarking methodology are proposed - microbenchmarking and synthetic benchmarking.
A microbenchmark is a type of benchmark that focuses only on a small characteristic or basic component of the system being evaluated. The main advantage of microbenchmarking is they are easy to develop and can be used in a wide variety of system (Ehliar & Liu, 2004). As shown in Barker and Shenoy's measurements, the performance fluctuation of each provisioned resource is different; in order to understand the performance characteristic of a virtual machine in the public coud, each provisioned resource must be microbenchmarked. The resources include CPU, disk, and network as they are shared among virtual machines; however as previously mentioned in chapter 2, RAM performance will not be quantified as it is statically provisioned to each virtual machine. The results from microbenchmarks will facilitate better performance analysis when combined with the results of decision support system performance from synthetic benchmarks.
To quantify the performance of decision support systems and how the isolation issues affect the performance, synthetic benchmarking methodology is proposed. A synthetic benchmark is a type of benchmark that focuses on wide range of the application behaviour space using parameterisation (Skadron et al., 2003). A suitable set of parameters allows the synthetic benchmark to demonstrate certain behaviours similar to real applications. By using synthetic benchmarking, the performance and various behaviours of decision support systems in the IaaS public cloud can be quantified.
3.2 Virtual machine setups
For the IaaS public cloud provider, Amazon EC2 was chosen for the reasons that it provides extensive management tools to customers to launch virtual machines and a large selection of setups to launch virtual machines or instances in Amazon EC2's term. The virtual machines that will be benchmarked to quantify their performance and how isolation issues affect them are as follows -
Standard small instance
Customised AMI based on ami-c517bb9
1 Virtual core (1 EC2 compute unit)
1.7 GB memory
160 GB Ephemeral storage (moderate I/O performance)
30 GB EBS storage
40 GB RAID0 EBS storage (4x10GB)
Microsoft Windows Server 2008 R1 SP2 Data Center Edition
32 bit platform
Standard large instance
Customised AMI based on ami-c9517bbd
2 Virtual core (2 EC2 compute units each)
7.5 GB memory
850 GB Ephemeral storage (high I/O performance)
30 GB EBS storage
40 GB RAID0 EBS storage (4x10GB)
Microsoft Windows Server 2008 R1 SP2 Data Center Edition
64 bit platform
Both ami-c517bb9 and ami-c9517bbd are only customised by installing all applications needed in this study to facilitate faster virtual machine launching process. The customisation includes 6 applications installed on both AMIs - OpenVPN for secured access to the virtual machines, LogMeIn for administration purposes, CPU microbenchmark application, SQLIO Disk Subsystem Benchmark tool, Iperf Network Testing tool, and Microsoft SQL Server 2008 Developer Edition. All virtual machines in this study are instantiated in EU-Ireland region (eu-west-1a) and all benchmarks are performed at different time or at the same time but using different virtual machines.
It should be noted that virtual machines based on Windows Server 2008 are instantiated with only a 30GB EBS storage and do not come with ephemeral storages. To launch a Windows Server 2008 based virtual machine with ephemeral storages, an Amazon EC2 API must be invoked using the following command -
ec2-run-instances <AMI ID> -k <Keypair> -g default -b "xvdg=ephemeral0" -b "xvdh=ephemeral1" -t <Instance type> --availability-zone eu-west-1a --region EU-west-1
3.3 CPU microbenchmark
The purpose of CPU microbenchmark is to understand the performance characteristic of the provisioned CPU and how the isolation issues affect it. In order to quantify the provisioned CPU effectively, a program that utilises only the provisioned CPU and not other resources is needed. To satisfy this condition, a single threaded floating point calculation program is used. This program is developed in C++ and compiled without compiler optimisation. The pseudocode of the calculation is provided below (sourcecode is provided in the appendix section) -
float a = 11.1111111111f;
float b = 10;
for(int loopNumber = 0;loopNumber<loop;loopNumber++)
b = 10;
start = clock();
b = b*10;
c = pow(a,b);
stop = clock();
The calculation (inner for loop) is timed to around 5.5 seconds for 1 EC2 compute unit. After the calculation is completed, the calculation time is recorded in milliseconds, and then the program sleeps for 5 minutes before restarting the calculation again. The program runs continuously for 24 hours on a virtual machine with 1 EC2 compute unit in a virtual core (small instance) and a virtual machine with 2 EC2 compute units in a virtual core (large instance). The reason to quantify the virtual core with 2 EC2 compute units is to observe how isolation issues affect the performance of virtual core with multiple EC2 compute units compared to the virtual core with single EC2 compute unit.
3.4 Disk microbenchmark
The purpose of disk microbenchmark is to understand the performance characteristic of the provisioned disk and how isolation issues affect it. To quantify the performance of the provisioned disk, a program that utilises the disk extensively while keeping the usage of other provisioned resources to minimum is needed. SQLIO Disk Subsystem Benchmark Tool is used to satisfy the condition. SQLIO tool is a program developed by Microsoft and can be downloaded from Microsoft's website (Microsoft, 2010). SQLIO simulates disk I/O operations according to several parameters. In this study, the performance of ephemeral storages with moderate and high I/O performance, and EBS storages in both traditional setup and RAID0 setup is measured using SQLIO tool. The RAID0 setup consists of 4 EBS storages; each contains 10GB of available space.
SQLIO tool is setup to perform I/O operations on 1 GB file generated by SQLIO tool. SQLIO tool is instructed by using windows shell script to perform multiple benchmark runs for 24 hours or 288 runs (full script provided in the appendix section). In each benchmark run, 4 types of I/O operations are performed - sequential write, sequential read, random write, and random read. All operations are to be performed using 2 threads and use 64KB block size for 60 seconds. The I/O bandwidth result from each operation is recorded in Megabytes/second. A benchmark run is declared complete when all operations are performed which take 4 minutes precisely, after which the shell script sleeps for 60 seconds before starting the next benchmark run.
3.5 Network microbenchmark
The purpose of network microbenchmark is to understand the performance characteristic of the network shared among virtual machines and how isolation issues affect it. To quantify the network performance, network latency and bandwidth must be measured. Therefore, a network microbenchmark run consists of a latency microbenchmark and a bandwidth microbenchmark. Windows shell script is used to perform multiple benchmark runs for 24 hours or 288 runs (full script provided in the appendix section). A benchmark run is declared complete after latency microbenchmark and bandwidth microbenchmark are performed which take approximately 10-13 seconds. The script then sleeps for 5 minutes before starting the next benchmark run. 3 network routes are benchmarked as follows -
1. Virtual machine to Virtual machine (on the same physical machine)
2. Virtual machine to Virtual machine (on different physical machines)
3. Virtual machine to a machine in Brunel University's network
3.5.1 Latency microbenchmark
For latency microbenchmarking, a program called traceroute is used (tracert in Windows platform). Traceroute is a network tool that can determine the route to a destination as well as the latency from the machine running Traceroute to a node in the route. Traceroute sends a packet with a time-to-live value attached to the packet to the destination. The time-to-live value is the number of nodes the packet needs to hop to before being discarded by that node in the route to the destination. After the node has discarded the packet, a reply is sent back to Traceroute, and the latency between the machine running Traceroute and the node can be measured in milliseconds. For each time-to-live value, Tracert sends 3 packets; after receiving all replies from a host, Tracert increase time-to-live value by one and send 3 packets again.
For the virtual machine to a machine in Brunel University's network case, latency is measured up to only the third hop. This is because after the third hop, the packets travel outside of Amazon EC2's network and the latency will be affected by numerous variables.
3.5.2 Bandwidth microbenchmark
For bandwidth microbenchmarking, Iperf tool is used (Iperf, 2010). Iperf is a network testing tool that allows users to measure both TCP and UDP bandwidth performance. Iperf consists of a server and a client and the bandwidth can be measured unidirectionally or bidirectionally. For this study, a client is setup to upload data to a server for 10 seconds, measure the maximum upload bandwidth of the client.
For the virtual machine to a machine in Brunel University's network case, due to security restrictions of the university's network, OpenVPN software is used to setup a virtual private network between the virtual machine and the machine in Brunel University's network.
3.6 TPC Benchmark H overview
TPC Benchmark H (TPC-H) is a synthetic decision support benchmark defined by Transaction Processing Performance Council (TPC) (Seng, 2003; TPC, 2010).TPC-H consists of 22 business oriented ad-hoc queries and concurrent data modifications. The queries and data used in TPC-H are deemed to have industry wide relevance and given a realistic context. TPC-H evaluates the performance of decision support systems by executing the queries in a standard database under controlled conditions. The performance metric reported by TPC-H is TPC-H Composite Query-per-Hour Performance Metric ([email protected]). In this study, TPC-H standard specification revision 2.11.0 is used as the benchmark implementation guideline (TPC, 2010). The purpose of using TPC-H is to understand the performance and various behaviours of decision support systems in the IaaS public cloud.
TPC-H are performed in 5 setups as follows -
1. Small instance using an ephemeral storage
2. Small instance using an EBS storage
3. Small instance using RAID0 EBS storage
4. Large instance using an ephemeral storage
5. Large instance using an EBS storage
6. Large instance using RAID0 EBS storage
In all setups, Microsoft SQL Server 2008 Developer Edition is used as a Database Management System instead of the express version which is preinstalled on most Windows 2008 based AMIs. The need to use the developer version is the ability to utilise more than 1GB of RAM while the express version cannot, and the tools provided by the developer version which allow faster TPC-H benchmark implementation.
For each small instance setup, 20 TPC-H benchmarks are executed which take around 24 hours to complete. For large instance setup, 50 TPC-H benchmarks are executed which also take around 24 hours to complete.
3.6.1 TPC-H execution process
A TPC-H benchmark consists of a load test and a performance test; a performance test consists of 2 runs. All tests must be executed under the same conditions, same hardware, same software configuration, same data manager, and same operating system parameters. The execution process (figure 3.1) is implemented by using mainly Windows shell script to call various applications related to TPC-H (full script provided in the appendix section). The sub process of each component in the execution process as well as the implementation is explained in later sections.
Figure 3.1: Execution process of TPC-H benchmark
3.6.2 Load test
A load test is the process that prepares a decision support system for a performance test and can be divided into 4 sub processes - database creation, table creation, data loading, primary key and foreign key creation (figure 3.2). The process of load testing is implemented using Windows shell script to execute SQL queries related to the process by using SQLCMD tool (full script provided in the appendix section). SQLCMD is a command line tool that allows users to execute an SQL query against SQL Server 2008 by specifying a SQL query and a database.
Figure 3.2: Execution process of a load test
The first sub process is TPC-H database creation. This sub process creates a database named TPC-H that will be used for the whole TPC-H benchmark.
The second sub process is table creation. The TPC-H schema used in the table creation consists of 8 tables - region, nation, supplier, part supplier, part, customer, lineitem, and orders (Figure 3.3). The data types and table layouts used to create the tables conform to TPC-H standard specification (SQL query provided in the appendix section).
Figure 3.3: TPC-H schema
The third sub process is data loading. 8 flat files are generated using a tool called DBGEN with a name corresponding to each table (data generation explained in later section). Each flat file is loaded into TPC-H database using Integration Services platform provided with SQL Server 2008.
The fourth sub process is primary key and foreign key creation. TPC-H standard specification does not require primary keys and foreign keys; however, they can improve the overall performance of data and do not invalidate benchmark's results. The SQL queries used to create primary keys and foreign keys can be found in the appendix section.
3.6.3 Performance test
A performance test consists of 2 runs. There are 6 components in a run as follows -
1. A query - is defined as one of the 22 TPC-H queries.
2. A query set - is defined as the sequential execution of the 22 TPC-H queries.
3. A query stream - is defined as a single emulated user executing a query set.
4. A refresh stream - is defined as the sequential execution of a pair of refresh functions
5. A pair of refresh functions - contains a data modification function (refresh function1) and a data delete function (refresh function2). The purpose of refresh functions is to simulate an event that new orders and lineitems are inserted and deleted from the database.
6. A session - is defined as the process of the execution of either a query stream or a refresh stream.
Each run composed of 2 tests - a power test and a throughput test. Each test utilises all aforementioned components of a run. A power test is executed after a load test; or in the case of a second run, executed after the first run is completed (figure 3.4). All query streams and refresh function2 are executed using SQLCMD (SQL queries provided in the appendix section). All refresh function1 are executed using Integration Services.
Figure 3.4: Execution process of a performance test
220.127.116.11 Power test
The purpose of power test is to measure raw query execution power of the decision support system under tested by simulating an event where a single user is using the system. The execution process of a power test consists of a query stream (Stream00) and a refresh stream (Refresh00) as illustrated in figure 3.5. The refresh stream is split by a query stream in the power test.
Figure 3.5: Execution process of a power test
18.104.22.168 Throughput test
The purpose of throughput test is to measure the ability of the decision support system under tested to process queries when there are multiple users using the system at the same time. The execution process of a throughput test is illustrated in figure 3.6. In this study, 5 query streams (Stream01-05) and 5 refresh streams (Refresh01-05) are executed in a throughput test which represent 5 concurrent users (N=5).
Figure 3.6: Execution process of a throughput test
3.6.4 TPC-H Data generation
DBGEN is used to generate data for this study. DBGEN is a data generation tool provided by TPC. DBGEN used in this study is version 2.4.0 and compiled using GNU Compiler Collection (GCC) under Cygwin environment (Cygwin, 2010; GCC, 2010). To allow quick benchmark result collection, this study uses data with scale factor of 1 (1 GB) which is the minimum scale factor requirement. The cardinalities and estimated table sizes are shown in table 3.1.
Table size (MB)
Table 3.1: cardinalities and estimated table sizes of TPC-H database with scale factor 1
Orders and Lineitem are generated with 75% sparse primary keys to allow refresh function1 and refresh function2 to insert and delete data respectively between the primary keys. A refresh function1 inserts new rows equal to 0.1% of total rows in both tables while a refresh function2 delete 0.1% of total rows in both tables. The data used to insert and delete rows by refresh functions are also generated by DBGEN. The data for each refresh pair is different; in this study, 12 pairs are used in a TPC-H benchmark (6 pairs for each run).
3.6.5 Query generation and modification
TPC-H contains 22 business oriented ad-hoc queries. These queries are generated using QGEN which is a query generation tool provided by TPC. This study uses QGEN version 2.4.0 compiled using GCC under Cygwin environment.
Each query contains substitution parameters to prevent the decision support system under tested to gain advantage by returning the result from an identical query. The parameters are generated by specifying seed number in QGEN. The seed numbers used in a TPC-H benchmark are shown in table 3.2. Queries are generated using SQLSERVER setting and modified to support SQL Server 2008. The modification is based on Microsoft TPC-H Benchmark kit version 1.0.0 which can be downloaded from the TPC website (TPC, 2010). The modification also complies with the specification. (SQL queries provided in the appendix section).
Query Stream No
Table 3.2 Seed number used in each query stream
In this study 2 primary metrics are used - load time and Composite Query-per-Hour metric ([email protected]).
22.214.171.124 Load time
Load time is the metric for load tests. The load time is the duration starting from the table creation to the end of primary and foreign key creation.
126.96.36.199 Composite Query-per-Hour metric ([email protected])
[email protected] is the performance metric for TPC-H. It can be calculated as follows-
[email protected] is the performance metric for power tests based on a query per hour rate (factored by 3600). [email protected] is the inverse of the geometric mean of the timing intervals. The units of [email protected] are (Queries per hour x Scale Factor), and can be calculated as follows-
QI(i,0) is the timing interval in seconds of query in the query stream of the power test. The interval starts from when the query is submitted to the decision support system and ends when the next query is submitted. For , the interval ends when the query's output is received.
RI(j,0) is the timing interval in seconds of refresh function . The interval starts from when the first character in the first request for execution of the refresh function is submitted to the decision support system and ends when the last transaction of the refresh function is completed successfully.
Size is the database size corresponding to the scale factor. For this study, the size is 1 GB (scale factor 1).
[email protected] is the performance metric for throughput tests. It is the ratio of the total number of queries executed over the length of the measurement interval. The units of [email protected] are (Queries per hour x Scale Factor), and can be calculated as follows-
is the timing interval starts from when the first query stream submits the first character of and ends when the last refresh function is completed.
S is the number of query streams used in the throughput test. For this study, 5 query streams are used.
Size is the database size corresponding to the scale factor. For this study, the size is 1 GB (scale factor 1).
The performance of the virtual machines in Amazon EC2 is quantified by using CPU microbenchmark, disk microbenchmark, and network microbenchmark. All microbenchmarks run for approximately 24 hours. The results from microbenchmarks will allow for better performance analysis of the decision support systems. TPC-H is used as a synthetic benchmark in this study to understand the performance and various behaviours of decision support systems is the IaaS public cloud. 20 TPC-H benchmarks are executed for the decision support systems deployed on small instances and 50 TPC-H benchmarks for large instances; both take approximately 24 hours to complete. In the next chapter, the results obtained from the benchmarks are presented and discussed in details.