Approaches For Placement Of Replica Distributed Systems Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Data replication is a key technology in distributed systems. It enables higher availability and performance. Data replication consists of maintaining multiple copies of data, called replicas, on separate computers. Replication improves availability by allowing access to the data even when some of the replicas are unavailable. Also, it improves performance through reduced latency, by letting users access nearby replicas.

Replication [1],[13] - is a process of creating and managing duplicate versions of data in multiple sites. It has been receiving considerable attention in distributed systems. There major motivations for replication - ensuring reliability, increase availability and improving performance of the system [1],[14],[15]. It creates redundant information allows the system to remain operational if a failure occurs. If local copy of the data is unavailable because of a failure in the network, the client can still access the remote copy of the data thus increase reliability. Also, client can access the local data rather than connecting to remote server over a network in order to improve performance by reducing response time and reduce communication cost.

Despite benefits of data replication to improve availability and performance, it is costly [17] and requires more storage and more processing time to update the replicated data [18]. Therefore, many researches have been done in this filed in order to find an optimal solution that support data replication availability and improve performance with less storage and updating processing time.

Many issues within any distributed systems should be considered [12] such as where, when, and by whom replicas should be placed, and subsequently which mechanism to use for keeping the replicas consistent. Data consistency [2],[14],[15],[16] is the major challenge within replication. Replication creates several copies residing on different sites (servers) which may introduce inconsistency between copies of the same replica therefore to ensure consistency; the system must behave like a single copy. Changes made to any replica are reflected in all the others.

We are focusing only in server placement replication which is concerned with finding the best locations to place a server that can host a data store. And there are various replication schemes that solve this problem, these algorithms are described in the literature.

Literature Review

The literature contains several solutions address replicated server placement problem. We will review the relevant previous literature in detail in order to compare between the different algorithms that solve the placement problem and find the optimal solution.

[3] Lili Qiu, Venkata N. Padmanabhan and Geoffrey M. Voelker consider the problem of placement strategies for Web server replicas with the context of content distribution networks (CDNs) that offer Web server hosting services. They propose several placement algorithms, including a simple greedy placement. The developed algorithms uses workload information, such as client latency and request rates, that can automate the server placement decision. They consider the following scenario; a popular web site aims to improve its performance by pushing its content to some hosting services. The problem is to choose M replicas among N potential sites (N> M) such that some objective function is optimized under a given traffic pattern. They assume that each client uses a single replica (multiple clients can use the same replica). They take the distance between clients and locations as their starting point. Distance can be measured in terms of latency or bandwidth [12].

As an alternative, [4] propose to ignore the position of clients and only take the topology of the Internet as formed by the autonomous systems [12]. They assume that each client selects the closest (in number of hops) replica. The second assumption they make is that they do not limit the number of a clients that can be assigned to a replica.

[7] Michal Szymaniak and Guillaume Pierre presents HotZone, an algorithm to place replicas in a wide-area network such that the client-to-replica latency is minimized. HotZone relies on the fact that Internet latencies can be modeled in an M-dimensional geometric space. In this model, nodes are assigned M-dimensional coordinates. The latencies between any two nodes are modeled as the distance between their corresponding coordinates. HotZone identifies network regions as groups of nodes with proximal coordinates and places replicas in the most active regions. This effectively reduces the cost of placing K replicas among N potential replicas locations from O(N2) to O(N.max (log N, K).

In [2] they propose a placement algorithm based on the assumption that the underlying network topologies are trees and solve it using dynamic programming techniques. The algorithm is designed for Web proxy placement but is also relevant to server placement. The algorithm works by dividing a tree T into smaller sub-trees Ti, they show that the best way to place t proxies is by placing ti proxies for each Ti such that ∑ ti = t. They focus on two major factors: the overall traffic and access latency. Their objective is to minimize the overall latency of searching a target web server subject to the system resources and traffic pattern.

[6] propose a dynamic replica placement algorithm for scalable content delivery. This algorithm uses a dissemination-tree-based infrastructure for content delivery and peer-to-peer location service provided by Tapestry for locating objects. There are two crucial design issues that they try to address. First, how to dynamically choose the number and placement of replicas while satisfying QoS requirements and server capacity constraints. Second, How to disseminate updates to these replicas with small delay and bandwidth consumption. Both must be addressed without explicit knowledge of the global network topology. Further, they would like to scale to millions of objects, clients, and servers.

In [5] J.Kangasharju, J. Roberts and K. Ross study the problem of optimally replicating objects in CDN servers. They consider each Autonomous System (AS) as a node in a graph with one CDN server with finite storage capacity for replicating objects. The optimization problem is to replicate objects so that the average number of ASs traversed is minimized when clients fetch objects from the nearest CDN server containing the requested object.

Where in [8] Eliana-Dina Tîrsa, Mugurel Ionu, Andreica and Alexandru Costan consider two offline optimal replica placement problems, for which they devise efficient algorithmic solutions. The first one is concerned with distributing the replicas of k files over the nodes of a tree network, such that every node stores at most one replica. In the second problem they compute the optimal (best case) and worst case replica creation strategies, given the data transfer reliabilities and availability probabilities of the nodes.

The data availability in Data Grid system is complicated by node failure, data catalog error and an unreliable network. [10] Presents two new metrics to measure the system data availability. They are the system file missing rate (SFMR) and the system bytes missing rate (SBMR). SFMR represents the ratio of the number of files potentially unavailable and the number of all the files requested by all the jobs. SBMR represents the ratio of the number of bytes potentially unavailable and the total number of bytes requested by all jobs.

In [11] a dynamic data replication mechanism is proposed, which is called Latest Access Largest Weight (LALW). It selects a popular file for replication and calculates a suitable number of copies and grid sites for replication. The popularity of a file can be deduced from its access rate from the clients. They set different weight for the records according to their lifetimes in the system in order to find the recently potential popular files.

[9] Presents a dynamic minimum access cost based replication strategy (MAC replication strategy). MAC replication strategy takes into account the access frequency, the status of the network connection and average response time to perform optimal replication. MAC replication strategy includes three steps: to decide the original which should be replicated, to decide the placement where the new replica should be places, and to decide the replication source. It first selects the popular files. To each popular file, the average response time is calculated to determine which logical resource should be replicated. Then MAC replication strategy calculates an appropriate site to replicate for better shortening the response time of the data resource.