The DSpace system

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


The DSpace system is a digital research repository system that was built to address a very real need among academic institutions-combating the problem of increasing amounts of scholarly work generated by faculty and students that was had sparse viewing and suffered from occasional preservation issues. The system, a joint venture between the Massachusetts Institute of Technology (MIT) and HP Labs, aims at providing an online, electronic repository system that stores, can provide organization and preservation services to scholarly work, to provide for broader exposure and a longer, if not infinite lifespan to such work [18].

Here we provide greater detail about the DSpace digital repository software. We include architectural diagrams that have been reprinted with the permission of the DSpace project to better illustrate the software's object model, architecture, and internal Processes

DSpace Overview

DSpace is a digital repository system that allows users to submit, store, and allow others to read and use information that may have broad appeal. With a specific focus on the preservation of stored data, DSpace employs digital preservation functions such as the storage of checksums along with digital objects in order to keep track of and verify a file's conformance with the original. DSpace is an open source software project and is entirely written in Java [19]. DSpace was created breadth first, so that most functionality required by organizations seeking to use such a software product was covered, in a simple and basic way [18]. DSpace has a developed underlying model that drives the way that users use the system, submit and use content, and how administrators can organize and configure the system. The software's underlying code base provides APIs that administrators and third party applications can use to interact with the DSpace system. In order to be more usable to different types of users, the software provides a configurable submission and workflow process that can be fit to any organization's policies and practices [18].

DSpace Object Model

While the functionalities of any software are important regarding its use, to have a full understanding of a software system it is necessary to look beyond the appearance of that software and get a glimpse at what happens on the backend. DSpace is no different. From simply browsing a DSpace repository a user can get a feeling for the structure of a DL, while the exact organization and division of its underlying object model remains relatively hidden. DSpace administrators, on the other hand, get a taste of DSpace's underlying structure no matter how much the built in, web-based administrative interfaces are used.

DSpace stores digital content, often referred to as digital objects; thus a very important part of the overall object model of DSpace systems is those objects themselves, called "Items". Items are organized in a hierarchy in which similar items are grouped and submitted into Collections of similar content. The highest level of content organization in the system is Communities, which are groups of Collections-in the model Collections are completely distinct from Communities, but make them up. As such, a Collection can be in more than one Community. Each Item stored in a DSpace repository is made up of a bundle of bitstreams, so as many files can be stored in a single digital object as needed [20]. Bitstreams adhere to the Bitstream Formats that the system knows about, and DSpace behaves in different ways with different types of objects-e.g., images may have their thumbnails displayed when browsing the system but .exe files cannot. The DSpace object model diagram is provided in Figure 3.

DSpace Architecture

As detailed in Figure 4, the DSpace software is divided into a relatively common three-tiered architecture [18]. These three layers are the Application layer, Business Logic layer, and Storage layer. On the lowest layer of this architecture, the Storage layer, all bitstreams stored in the repository are stored as files on the system's file structure. References to these files and most other metadata, settings, and other information that drives the behavior of the system are stored in a relational database system, usually PostgreSQL. This marriage of techniques allows for the quick, relationally oriented access strategies for metadata and runtime data, while keeping stored documents in a regular file system. Together the Storage layer aspects of DSpace make up the Storage API. The Business Logic layer is made up of a set of classes or modules that embody the inner workings of many DSpace object types, including user related functions, browsing and searching related aspects, content management, and others. Business Logic classes make up DSpace's Public API, which allows third party code to interact with DSpace in the same way that typical interaction within the software occurs. Lastly, the Application layer is the highest level layer of functionality in DSpace and brings together DSpace backend functionalities to provide the services and functionality that users see when they use the system. Included in the Application layer are the import/export functionality the software provides, statistics tools, and the web-based user interface. Given DSpace's open source nature, all of these software aspects have source code available to organizations using the system that can be tweaked and customized to more adequately meet their needs.

DSpace Ingestion

DSpace is a software system that serves as a repository which stores digital content. In a system with such a goal, perhaps the most critical aspect of the system is how that data enters the system. This occurs two main ways within DSpace. The webbased UI for the software allows a user to submit items to collections as long as they are logged in as a registered user. When users do such, they go through a configurable workflow [21] where they upload and describe their submissions. (Workflows in DSpace are discussed further in Section

Alternatively, DSpace administrators who have a large amount of content to be batch imported may take advantage of the import/export functionalities of the system [20]. The Item Importer is a command line tool that comes bundled with the system and allows users to import collections of content into the system.

The Item Importer uses DSpace's simple archive format, which is a simple directory structure that holds items for import into the system. (We provide an example of a simple archive in Figure 6.) A top level archive directory contains uniquely named directories, each of which contains everything necessary to import a single item. Each sub-folder is required to contain two files, in addition to the actual content to be imported. The required file "dublin_core.xml" contains an XML representation of qualified Dublin Core element names and the textual content that those metadata records should contain, including author, title, and so on. A plain text "contents" file has one line containing the filename of each file that will be included in that digital object. Once this structure is put in place, the tool can simply be run and all content will be imported into the repository in question. The tool provides a "map file" after being run, which details all items that were imported and their new location within the system-this file can help with future exports or removal of groups of imported content [20].

DSpace Workflow

DSpace is one of the first open source repository systems to successfully combat the problems that lie in different requirements for submission of different types of information to different collections [18]. The DSpace submission workflow system is a critical part of the DSpace architecture that allows for the submission, processing, and final addition of content to the live repository. DSpace's underlying model includes EPeople, users who have registered with the system and have certain authorizations, roles, rights, and privileges that translate to abilities to complete certain tasks within the DSpace system. A typical submission begins with the system asking the user a couple of questions about the publication history of the item and the number of files involved in the submission. Then the system guides the user through the different steps of the process, which are outlined in Table 3.

Manakin for DSpace

DSpace is a very mature project. Since its inception in 2000 there have been many developments and iterations which have strengthened the overall feature set, interaction, and configurability of the software. Although relating to the separation from the JSP / Servlet differentiation of presentation vs. business logic layers, one of the greatest issues users of the software experience is a difficulty in branding, or creating a customized look and feel, in DSpace's web-based interface.

One project to combat this problem is what Texas A&M refers to as their DSpace XML UI project, the current second iteration of which is called Manakin [22]. Manakin acts as a layer above the traditional JSP presentation layer of DSpace and allows for each community or collection within the repository to have its own look and feel. The software splits up its realization into Aspects and Themes [22]. Aspects are different parts of a DSpace website that can be customized or toggled on and off in a DL instance, such as the login mechanism. Themes are the visual character of these different Aspects that make up DSpace websites. When combined, these two logical parts of Manakin work together to provide a more customizable and attractive web-based interface for DSpace. Manakin is an "add on" of sorts to DSpace, so before it can be used a working copy of DSpace must be installed and configured.

DSpace Future Roadmap

Many drawbacks of DSpace that have been identified by the user community in the past are things that are included to be addressed in the published roadmap for DSpace 2.0, which is beginning to be formulated and will likely have initial versions released in 1½ to 2 years [3]. The new design will have particular focus on making the system more scalable by addressing issues to allow greater capacity of items, improving the ingestion rate of objects in bulk add situations, and making the system's processing more concurrent. While some of the largest known DSpace repositories contain hundreds of thousands of digital objects, the goal of the new system is that the architecture should be able to store and support up to 10 million items [3]. There is desire to make the system more interoperable, and work in this area will include a much more concrete and descriptive object model and core interface, both of which would be clearly published in order to better facilitate interoperation for administrators and third parties. The item model will be revised to include versioning of items plus other changes. To make the nature of the software more plug and play oriented, an extension model is being investigated that would greatly ease the difficulty in creating and using components that can easily snap into a DSpace instance [3].