Distributed Storage Local Reconstruction Codes Computer Science Essay

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Abstract- Cloud computing delivers computing as a service, through which resources, software, and information are provided to users as a utility over the internet. Economic gains are the foremost driver for the Cloud. Though data outsourcing eliminates the burden of local data storage and maintenance, it also removes their storage reliability and security. Hence correctness of data is a prime concern. In order to address this problem a reliable storage system based on Local Reconstruction Codes (LRC) can be used. While reconstructing data fragments LRC reduces the number of erasure coding fragments that need to be read, but keeping the storage overhead low. The vital benefits of LRC are that it moderates the bandwidth and input output operations required for repair reads, while still agreeing a major reduction in storage overhead. With the support of Third Party Auditing this scheme guarantees data error correction along with integrity checking.

Keywords- Cloud, Integrity, Security, Distributed storage, Local Reconstruction Codes


According to National Institute of Standards and Technology (NIST) Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage,

applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction [1]. Cloud computing users do not own the physical infrastructure; rather they rent service from a cloud provider. They consume resources and pay only for what they use. Cloud storage systems are housed in facilities called datacenters. Client can communicate with one of the servers. But there can be several datacenters on which data can be stored. Since users don't have to care about the difficulties of direct hardware management, moving files into the cloud offers great convenience to them. A malicious cloud provider or any other outsider can modify the client's data and hence, the integrity of the data will be lost. Maintaining the correctness of user files is not so easy because the data files tend to be updated frequently by the user.

The coolest approach to provide integrity for data is to duplicate it. But evidently, replication uses a lot of storage. The storage cost, or overhead will be high. Erasure coding schemes are the other key solution to address this problem. In erasure coding, a file F consisting of k blocks can be encoded using an (n; k) erasure code to create n coded blocks out of the original k file blocks, and stores them at n servers on the basis of one encoded block per server. Thus, the original file can be regenerated from any k out of the n servers. When the client discovers corruption of one of the encoded blocks, it can make use of the remaining healthy blocks to reconstruct the corrupted block. The desirable properties of an erasure coding scheme used in a cloud storage system should possess the following characteristics:

Reconstruction of data from minimum number of fragments

Less storage overhead

Local Reconstruction Codes (LRC) provides the above properties. In addition, we can analyze the Reed Solomon encoding which is a widely accepted erasure code along with Local Reconstruction Codes.


Cloud computing make it possible to store large amounts of data. But data integrity is the main problem with cloud storage. Juels et al. [2] described a proper "proof of retrievability" (POR) model for ensuring the remote data correctness. Their scheme ensures both ownership and retrievability of files on archive service systems. Shacham et al. [3] built a random linear function based homomorphic authenticator which enables infinite number of queries and requires not as much of communication overhead. Bowers et al. [4] proposed an better framework for POR protocols that simplifies both Juels and Shacham's effort. Later in their succeeding work, Bowers et al. [7] stretched POR model to distributed systems as a High Availability and Integrity Layer (HAIL). Still, all these systems are concentrating on static data. Any alterations to the contents of the file, even few bits, must disseminate through the error correcting code, consequently introducing substantial computation and communication complexity.

Ateniese et al. [5] proposed the "provable data possession" (PDP) model for guaranteeing possession of file on untrusted storages which is based on public key based homomorphic tags for reviewing the data file, therefore ensuring public verifiability. But, this approach requires sufficient computation overhead and can be expensive for an entire file. Later, Ateniese et al. [6] defiined a PDP scheme that uses only symmetric key cryptography for which overhead is lesser than preceding scheme and permits future modifications. But, this approach concentrates only on single server. Curtmola et al. [9] suggested a solution to this problem by allowing data possession of multiple replicas. Shah et al. [10] offered allowing a TPA to keep online storage honest by first scrambling the data then transferring a number of pre-computed symmetric-keyed hashes over the encrypted data to the auditor. But, their scheme only works for encrypted files.


If encoding and decoding takes a lot of computational resources then it annoys other operations like encryption, compression and so on. Because of the generality of the Reed-Solomon codes they are widely used in distributed storage systems. Reed-Solomon coding is designed for deep space communication, because mistakes occur frequently in that type of communication. However, the data centers act in a different way from that of deep space communication. A well-designed and -supervised data center has very less failure rate which in turn entails that furthermost data files in the data centers are strong, with no failures. Even though some errors are present they will be very less in amount and will last for a short duration. Also error will be repaired as fast as possible and will be brought them to healthy state.

The new coding approach enables data to be reconstructed more rapidly than with Reed-Solomon codes. This is because lesser data fragments must be accessed to regenerate the original data. Only half the fragments are necessary. Moreover, LRCs are mathematically simpler than former procedures. The local in the coding technique's name denotes, in the event of a failure, the code required to recreate data is not spread across all the servers. The data required to be accessible quickly, without reading too much data.

Data durability is outstanding with these type of codes such that a data fragment can be rebuilt with complete accuracy. The reduction in communication overhead helps to decrease the storage cost and it has faster reconstruction costs than previously available codes.

It is well known that erasure-correcting code may be used to allow numerous failures in distributed storage systems. In cloud storage systems, we can depend on this method to disperse the data file redundantly across a set of servers. Message to be transmitted is divided into several blocks. An (m,k) Reed-Solomon erasure-correcting code is used to produce k redundancy parity vectors from m data vectors in such a way that the original m data vectors can be recreated from any m out of the m + k data and parity vectors. By placing each of the m + k vectors on different servers, the original data file can tolerate the failure of any k of the m + k servers without any data loss, with a storage overhead of k=m[8].

Fig. 1 Reed Solomon Codes Definitions

As an example[13] consider a (6, 3) Reed-Solomon code contains 6 data fragments and 3 parity fragments, in which each parity is calculated from all the 6 data fragments. When failure occurs for the reconstruction all the 6 fragments are required. For simplicity consider reconstruction cost as the number of fragments required to rebuild an inaccessible data fragment. Therefore here it is equals to 6. The aim of LRC is to decrease the reconstruction cost. It attains this by calculating some of the parities from a subset of the data chunks.

With the same example LRC creates 4 parities. The first two parities represented as p0 and p1are known as global parities and they are computed from all the data fragments. LRC splits the data chunks into two equal size clusters and computes a local parity for each cluster.

Fig. 2 Local Reconstruction Code Example

When we analyze this type of codes we can realize that the reconstruction of any single data fragment needs only 3 fragments, half the number needed by the Reed-Solomon code. LRC increases parity by one than Reed-Solomon.

A (k,l,r) LRC divides k data fragments into l groups, with k/l data fragments in each group. It computes one local parity within each group. In addition, it computes r global parities from all the data fragments. Let n be the total number of fragments (data + parity). Then n = k + l + r. Therefore, the normalized storage overhead is n/k = 1 + (l + r)/k. So a (6, 2, 2) LRC has storage cost of 1 + 4/6 = 1.67x, as demonstrated in Figure 3 [13].

Fig. 3 Decoding Failures in LRC

We must choose the coding equations such that LRC can achieve the Maximally Recoverable (MR) property [14], which means it can interpret any failure which is information-theoretically decodable. Given the (6, 2, 2) LRC, with 4 parity chunks and can survive up to 4 failures. But, LRC is not Maximum Distance Separable and thus cannot tolerate arbitrary 4 failures. For example, say the 4 failures are x1, x2, x3 and px. This is non-decodable. Failure patterns that are likely to recreate are so-called information-theoretically decodable.

In common, the vital properties of a (k,l,r) LRC are: i) single data chunk failure can be decoded from k/l fragments; ii) arbitrary failures up to r + 1 can be decoded. Based on the subsequent theorem, these properties impose a lower bound on the number of parities [12].

Theorem 1. For any (n,k) linear code (with k data sym-bols and n − k parity symbols) to have the property:

Arbitrary r + 1 symbol failures can be decoded;

Single data symbol failure can be recovered from symbols,

The following condition is necessary:

n − k ≥ l + r.

Since the number of parities of LRC meets the lower bound exactly, LRC accomplishes its properties with the minimal number of parities.

Before distributing the file into the provider's location user can make use of LRC and result will be a collection of data vectors with parity vectors. As a next step user computes short verification tokens on individual vector based on pseudo random functions. After blinding the parity information, user disperses all encoded vectors across the cloud servers. Later user can challenge the cloud server and Correctness verification and the identification of the misbehaving server can be done with the help of a Third Party Auditor (TPA). Intervention of TPA reduces the computation overhead of the user.

The data stored in the cloud may not only be accessed but also be frequently updated by the users including insertion, deletion, modification, appending, etc. This scheme further supports secure and efficient dynamic operation on data blocks without the need to download all the data from the cloud servers and recompute the whole parity blocks as well as verification tokens[11].


Erasure coding is important to lessen the expense of cloud storage. Most important factor in these codes is fast reconstruction of offline data chunks. A (k,l,r) LRC divides k data fragments into l local groups and encodes l local parities, one for each local group, and r global parities. Single data chunk failure can be decoded from k/l fragments within its local cluster. Additionally, LRC reaches Maximally Recoverable property. It accepts up to r + 1 arbitrary fragment failures. It also tolerates failures more than r +1 (up to l +r), as long as those are information-theoretically decodable. As a final point, LRC provides low storage overhead. For the reconstruction, LRC needs the minimum number of parities. Local Reconstruction Codes as a way to decrease the number of fragments that need to be read from to perform this reconstruction and has similar latency for small I/Os and improved latency for large I/Os.