IDS Repository Architecture and Ingest Pipelines
|Figure 1: IDS Repository Functional Architecture|
The main components are realized in Fedora Commons. Archive Information Packages (AIPs) together with descriptive and technical metadata are represented in the form of Fedora Digital Objects, which are serialized in the Fedora XML format (FOXML) and stored as self-contained XML files (Archival Storage). A resource typically consists of multiple digital objects, which are related with each other. A digital object in turn consists of several datastreams, representing metadata and actual data, in possibly several formats.
For Data Management Fedora Commons uses a relational database and a triple store to enable efficient querying, navigation, and checking of referential integrity.
Ingest is supported by means of a simple Administration Web Interface (currently only enabled for repository administrators) and by scripts for bulk ingest. The IDS Repository typically uses bulk ingest, which is fed by custom Ingest Pipelines that generate FOXML objects from Submission Information Packages (SIPs) provided by the producer.
For Access the IDS Repository uses Fedora Common's REST-API, which supports a Web Interface for Search. The data consumer has direct access to the archived objects via the web, provided that access requirements have been met. These dissemination information packets (DIPs) either consist of individual datastreams for metadata and data, or a packed archive of the entire resource. In addition, the IDS Repository supports harvesting of its metadata via an OAIprovider.
Corpus resources to be archived in the IDS Repository are represented in a wide variety of models and formats, and maintained with a wide variety of systems comprising relational databases (Corpus: Diskurs in der Weimarer Republik), file system (partially on DVDs) (Corpus: Mannheimer Korpus Historischer Zeitungen und Zeitschriften), and other custom solutions (Dereko, FOLK). This variety has partially historical reasons, but also technical reasons, arising from domain specific requirements with respect to acquisition and use of resources.
For sustainably archiving these resources, they typically need to undergo extensive processing. Figure 2 shows the main steps of this process:
|Figure 2: Generic Ingest Pipeline|
In the first step (Alignment), metadata and data are aligned with each other. Often metadata are represented as a comma separated vector with references to the actual data. This step ensures a one-to-one correspondence between data and metadata, which often requires normalization of references and identifiers.
In the second step (Validation/Curation) data formats are validated using format specific validators, and for data that are not available in one of the recommended formats additional representations are generated (typically XML based docbook or TEI for written corpora).
In the third step (Metadata Extraction), additional metadata, such as title, or issued date, are extracted from the data, taking advantage of the curated XML based data formats generated in the previous step.
In the forth step (CMDI Generation), metadata are transformed to a suitable profile of the Component Metadata Infrastructure CMDI. Often this step involves the specification of a new profile partially based on existing CMDI components to appropriately represent available metadata.
All these steps typically require the implementation or configuration of custom tools for a particular resource. Often these steps reveal (minor) inconsistencies in the resource, such as misaligned or missing metadata or invalid data formats. These inconsistencies are documented and communicated to the producer for resolution. The result is a Submission Information Package (SIP) that consists of interrelated digital objects described with CMDI metadata.
On this basis, the last step generates FOXML for batch ingest into Fedora Commons. This step is implemented as a generic, configurable pipeline detailed in Figure 3:
|Figure 3: FOXML Generation Pipeline|
First CMDI records are complemented with missing information such as headers and validated against their profile represented as an XML Schema. Then for all CMDI records and actual data FOXML datastreams are generated using FOXML templates that specify additional technical metadata required by Fedora Commons, such as mime type, state, or local identifier. In addition, (meta)datastreams required by Fedora Commons (DC for Dublin Core and RELS-EXT for enabling OAI harvesting) are generated by means of XSLT stylesheets from the CMDI records. Other (meta)datastreams, e.g., for human readable presentation of metadata can be generated in this step too. The resulting datastreams are again validated against their schemas. Finally, for all datastreams that are disseminated by the repository, a persistent identifier (PID) is registered at a handle server, and all (local) references to digital objects and datastreams are substituted with the corresponding PID. To allow for local maintenance of PIDs, the underlying mapping between PIDs and references is stored in the repository. The resulting FOXML digital objects are ingested into Fedora Commons in a batch process.