IDS Repository Architecture and Ingest Pipelines
Functional Architecture
The IDS Repository uses Fedora Commons [1] as the underlying repository framework.
Figure 1 shows the main functional entities of the IDS Repository along the lines of the OAIS Reference Model [2, Section 4.1].
Figure 1: IDS Repository Functional Architecture |
The main components are realized in Fedora Commons. Archive Information Packages (AIPs) together with descriptive and technical metadata are represented in the form of Fedora Digital Objects, which are serialized in the Fedora XML format (FOXML) and stored as self-contained XML files (Archival Storage). A resource typically consists of multiple digital objects, which are related with each other. A digital object in turn consists of several datastreams, representing metadata and actual data, in possibly several formats.
For Data Management Fedora Commons uses a relational database and a triple store to enable efficient querying, navigation, and checking of referential integrity.
Ingest is supported by means of a simple Administration web interface (currently only enabled for repository administrators) and by scripts for bulk ingest. The IDS Repository typically uses bulk ingest, which is fed by custom Ingest Pipelines that generate FOXML objects from Submission Information Packages (SIPs) provided by the producer.
For Access the IDS Repository uses Fedora Common's REST-API, which supports a web interface for Search. The data consumer has direct access to the archived objects via the web, provided that access requirements have been met. These dissemination information packets (DIPs) either consist of individual datastreams for metadata and data, or a packed archive of the entire resource. In addition, the IDS Repository supports harvesting of its metadata via an OAIprovider.
Ingest Pipelines
Corpus resources to be archived in the IDS Repository are represented in a wide variety of models and formats, and maintained with a wide variety of systems comprising relational databases (Corpus: Diskurs in der Weimarer Republik), file system (partially on DVDs) (Corpus: Mannheimer Korpus Historischer Zeitungen und Zeitschriften), and other custom solutions (Dereko, FOLK). This variety has partially historical reasons, but also technical reasons, arising from domain specific requirements with respect to acquisition and use of resources.
For sustainably archiving these resources, they typically need to undergo extensive processing. Figure 2 shows the main steps of this process:
Figure 2: Generic Ingest Pipeline |
In the first step (Alignment), metadata and data are aligned with each other. Often metadata are represented as a comma separated vector with references to the actual data. This step ensures a one-to-one correspondence between data and metadata, which often requires normalization of references and identifiers.
In the second step (Validation/Curation) data formats are validated using format specific validators, and for data that are not available in one of the recommended formats additional representations are generated (typically XML-based DocBook or TEI for written corpora).
In the third step (Metadata Extraction), additional metadata, such as title, or issued date, are extracted from the data, taking advantage of the curated XML based data formats generated in the previous step.
In the fourth step (CMDI Generation), metadata are transformed to a suitable profile of the Component Metadata Infrastructure CMDI. Often this step involves the specification of a new profile partially based on existing CMDI components to appropriately represent available metadata.
All these steps typically require the implementation or configuration of custom tools for a particular resource. Often these steps reveal (minor) inconsistencies in the resource, such as misaligned or missing metadata or invalid data formats. These inconsistencies are documented and communicated to the producer for resolution. The result is a Submission Information Package (SIP) that consists of interrelated digital objects described with CMDI metadata.
On this basis, the last step generates FOXML for batch ingest into Fedora Commons. This step is implemented as a generic, configurable pipeline detailed in Figure 3:
Figure 3: FOXML Generation Pipeline |
First CMDI records are complemented with missing information such as headers and validated against their profile represented as an XML Schema. Then for all CMDI records and actual data FOXML datastreams are generated using FOXML templates that specify additional technical metadata required by Fedora Commons, such as mime type, state, or local identifier. In addition, (meta)datastreams required by Fedora Commons (DC for Dublin Core and RELS-EXT for enabling OAI harvesting) are generated by means of XSLT stylesheets from the CMDI records. Other (meta)datastreams, e.g., for human readable presentation of metadata can be generated in this step too. The resulting datastreams are again validated against their schemas. Finally, for all datastreams that are disseminated by the repository, a persistent identifier (PID) is registered at a handle server, and all (local) references to digital objects and datastreams are substituted with the corresponding PID. To allow for local maintenance of PIDs, the underlying mapping between PIDs and references is stored in the repository. The resulting FOXML digital objects are ingested into Fedora Commons in a batch process.
All data streams and versions are equipped with a MD5 checksum, which is checked in coordination with the backups as described below.
Storage Procedures
The IDS repository runs on a three-node virtualization cluster hosted by IDS.
The necessary storage is provided by a redundant storage system. The machines are housed in a modern data centre that was completely overhauled in 2014. It provides redundant air conditioning and redundant uninterrupted power supplies, early fire detection and fire suppression system using Novec 1230 as suppression agent. Access to the data centre is limited to authorized staff. Maintenance of the systems is performed by a team of trained personnel.
Access to the virtual server is restricted by a firewall. The storage hardware and hardware for virtual machines is replaced at regular intervals to the latest state of art.
The IDS repository, that is data and operating system, is backed up Monday through Thursday with incremental backups. Backups are performed on Fridays, full (4th Friday of the month) and differential (1st, 2nd, 3rd, 5th Friday) ones, respectively. Backups have a retention period of three months and are stored on a dedicated backup server on disks and tapes. The backup system is co-located in the data centre of the Mannheim University, which resides in a different part of the city.
The repository runs on a current and supported version of CentOS/RHEL as a virtual machine on a on a there node VMware vSphere cluster. The ESX hosts run on HPE servers and a dual controller NetApp storage appliance provides the necessary storage. The data paths between those systems are redundant. Software and firmware updates for all components are regularly applied.
The IDS repository virtual machine, the backup server and other critical infrastructure (storage systems, virtualization cluster, and network equipment) are monitored with Icinga, a network and service monitoring software.
Integrity of the data is ensured by the version control feature in the Fedora-Commons backend. Metadata is a data stream within the digital object, and as such is version-controlled like object data. CLARIN subscribes to the idea of reproducible research. Therefore, updates or new versions of resources typically are equipped with a new persistent identifier (PID). Only marginal changes to CMDI metadata are versioned without registering a new PID.
Making Data Future-Proof
Crisis Management
Crisis management is based on the technical infrastructure described above. In addition, the IDS repository archives all metadata and data in such a way that they can be easily migrated to and mirrored at other CLARIN (European CLARIN, German branch) resource centres. All metadata and data have a persistent identifier (PID), and are stored as self-contained XML files.
In case of a withdrawal of funding for a repository, the repository content will be transferred to another CLARIN centre. Legal aspects of the process of relocating data to another institution is addressed by templates of license agreements provided in CLARIN.
File Formats and Data Interpretability
The following measures are taken to enhance the chance of future interpretability of the data.
The number of accepted file formats is small and well documented to make future conversions to other formats more feasible. As much as possible, open (non-proprietary) file formats are used. For textual resources, XML formats are used whenever possible, to ensure future interpretation of the files even if the tool that was used to create them no longer exists. Formats recommended by the IDS repository for the deposited data are listed in detail in the IDS section of the CLARIN Standards Information System. In addition, for spoken corpora, the data formats of FOLKER (Documentation in German, XML Schema) and EXMARaLDA (Documentation, Document Type Definitions) are currently accepted. The encoding for textual sources (plain text, XML, etc.) should be Unicode, to ensure future interpretability.
The IDS participation in relevant networks like e.g. CLARIN and nestor enables steady information about recent developments in file formats and encodings. Plans to migrate or convert files will be developed if new standards arise; all relevant features of the old formats will be preserved employing reliable procedures.
References
1. Carl Lagoze, Sandy Payette, Edwin Shin, and Chris Wilper: “Fedora: An Architecture for Complex Objects and their Relationships”, International Journal on Digital Libraries 6(2): 124–138, 2006 [preprint].
2. Reference Model for an Open Archival Information System (OAIS), Recommended Practice, CCSDS 650.0-M-2 (“Magenta Book”) Issue 2, June 2012.