Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 6, Number 1, 1Q2007

 

HMG Consulting

Saratoga, CA 95070

©  2007, H.M. Gladney

 

ISSN: 1547-8610


Digital Preservation

Digital preservation activities seem to be entering a new phase, shifting from need to solve basic problems to need to implement solutions and establish repository procedures.  This is signaled by new books attempting to treat their topics comprehensively.

Public symposia and workshops focusing on digital preservation seem ever more frequent, with roughly one somewhere in the world each month.  As the topic is maturing, it would help everyone interested if conference organizers, and also authors, would clearly indicate their focal levels, e.g., one of

·         Fundamental principles for knowledge and information preservation (epistemology);

·         Methodological principles and software architecture;

·         Software tools and other technology implementations;

·         Archival institution management; or

·         Tutorial presentations, particularly for librarians and archivists.

A Reader’s Comment on Preserving “Everything”

John Erickson commented on recent DDQ numbers: [1]

“… the myth of expert archiving and dangers of not preserving "everything."  There has been much hubbub over the past few years (partly inspired by the Da Vinci Code) over non-canonical documentation of early Christianity and how it should be considered.  Documents such as the Nag Hammadi (Dead Sea) scrolls, which introduced non-canonical sources into the mix, have given us a broader context for considering these traditions. …

“… a small sampling of "alternative" documents can have a major impact, … [suggesting] preserving as much as possible.  Surely there were other contemporary sources which did not survive.  And, considering how the "Canon" was formulated (esp. via the Council of Nicene) and its subsequent effect on what was preserved or destroyed, … "expert" preservation [should] be considered dangerous and suspect, because every such instance is merely an opinionated selection … of what will be [wanted] by future scholars.

“… related to this issue of "preserving as much as possible" [as described in] MyLifeBits[2]  … exemplifies the problem of not knowing at artifact creation time what data might be useful or required at use time.  Most efforts to simplify the data set will invariably reduce the usefulness of the artifact.  A scientist knows never to destroy original data, but our IT culture does it every day; in fact, many corporations have policies for deleting old data for both policy reasons and to save space.

Erickson’s reminders of the selection challenges of building a long-term digital collection are appropriate, but need to be balanced by practicalities.

Saving “everything” needs to be tempered by anticipation of future user’s ability to find what might interest him.  An extreme case illustrates the point.  During one of the Government-IBM antitrust lawsuits, the judge ordered that every document copy which a recipient had annotated was to be saved—including telephone books.  IBM instructed employees to deposit all discards in large collection bins.  Allegedly all the content was dumped in warehouses, without any organization or inventory.  A hallway joke was that, whenever the court order would be lifted, the price of scrap paper would drop.  And it did!

The concern with derivative documents that include editorial change, illustrated with the Nicene Creed, can readily be overcome by responsible conservators.  If they accompany each edited document by honest and adequate provenance information that is firmly bound within a resultant archival object, that object will not mislead future scholars.  In fact, it will be an authentic archival object.[1]

New Books and Reports on Digital Preservation[3]

The five books reported below are so new that, as far as I know, no independent reviews have yet appeared.  It would help the cultural heritage community for some DDQ reader to review some number of these books.

Deegan & Tanner, Digital Preservation[4]

This essay collection sketches recent progress, including how digitization is changing in archives, libraries, and museums.  The essays also suggest, in view of digital preservation as a moving target, the value of periodic overviews for deciding next activities.  Chapters cover:

·         Key issues in and strategies for digital preservation

·         The status of preservation metadata in the digital library community

·         Web archiving

·         The costs of digital preservation

·         European approaches to digital preservation

·         Digital preservation project case studies.

Digital Preservation is intended to be a guide for information managers, librarians and archivists, as well as for students in library and information studies courses.

Borghoff et al., Long-Term Preservation of Digital Documents[5]

These authors briefly describe markup and document description languages (TIFF, PDF, XML, Dublin Core, …), explain migration and emulation techniques, and present the OAIS (Open Archival Information System) Reference Model.  To complement this technical background, they present selected repository projects (at Cornell University and the National Library of the Netherlands).  A rated survey of systems and tools completes the book.

This work is intended for librarians, computer scientists, and information managers engaged with social and methodological requirements for long-term information access.

Masanès, Web Archiving[6]

Masanès has collected essays about tools, tasks, processes, and standards needed to preserve portions of the WWW.  His book can serve as an introduction to keeping online information alive.  It covers issues related to building, using and preserving Web archives for computer scientists and librarians.  This book is intended to be a state-of-the-art overview for practitioners.

Batini & Scannapieco, Data Quality[7]

Batini and Scannapieco systematically introduce quality issues for federated data[8], web data, and other time-dependent data, classified according to frequency of change.  The book describes methodologies from core data quality research as well as data mining, probability theory, statistical data analysis, and machine learning.  It ends with critical comparison of tools and practical methodologies for data quality problems.

This book is broadly targeted—for researchers, students, and engineers who want an introductory course or self-study on its topics.

Gladney, Preserving Digital Information[9]

Preserving an information collection is a different challenge than managing archives.  Preserving Digital Information addresses[10] fundamentals and software design for preserving a file collection[11] indefinitely—methodology claimed to be complete and optimal for any information types (representations) whatsoever.  In a nutshell:

·         Any information pattern can be protected against loss by replicating its carrier object in independent repositories.

·         A perpetually unique document identifier (easily constructed) that is embedded in each preservation object will enable durable indexing for global search engines.

·         Ensuring that eventual users can exploit any preserved document as its authors intend can be achieved by augmenting its source version with representation in a lingua franca appropriate to its genre.

·         Making a document trustworthy can be achieved by firmly binding evidence, using cryptographic signatures of individuals or enterprises that have little to gain and much to lose by endorsing misrepresentations.

·         Essential document relationships that define collections and critical dependencies can be reliably preserved by binding inter- and intra-document links to digital document hash codes.

·         Everything that is new and essential[12] can be implemented by straightforward extensions of office document software based on a relatively small number of widely used international standards, such as core portions of XML, character coding standards, cryptography, and the Church-Turing thesis.

A synopsis and table of contents is available on-line.  Another Web page provides actionable links to the book’s citations of Web-accessible references.

We call the two methodological components contributed by my colleagues and myself “durable encoding” and “durable evidence”.[13]  A durable encoding prototype has been implemented in the National Library of the Netherlands DIAS and is being elaborated in Kopal.  Durable evidence methodology is being pursued in  ArchiSafe.  Both projects project enjoy German Government funding.

Database Preservation

A workshop announcement suggests that attention to preserving databases is timely:

Most of scientific research is now based on digital data resources, and databases are playing an increasingly important role.  Much of the data is either impossible … to reproduce or can only be recovered at enormous costs ….  Nearly every reference manual, dictionary and gazetteer benefits from some form of database management support, … The need for preservation is self-evident.

While considerable thought has been given to the preservation of fixed "digital objects" studied in the past, the preservation of databases, which have an internal structure and which may change over time, poses new challenges.    Libraries, the traditional curators of … reference material, have largely abrogated their archival responsibility to databases.  Database preservation raises new technical, economic and legal issues.

This announcement continues by posing 13 questions (reproduced below).  Addressing even these questions thoroughly seems beyond what a one-day workshop can accomplish.  The workshop purpose might be advanced by identifying what is already known about database preservation.  On the other hand, an unasked question seems worthwhile: Where should we look for ideas and for software for preserving databases? 

To keep what follows brief, we limit its scope and style.  As in the workshop announcement, what follows assumes that how to preserve static files is sufficiently understood.  It further assumes that, if a file can be reliably and reversibly converted to a file format that we know how to preserve, we can preserve that file by converting it to the understood format and saving that together with the reverse conversion rule. 

Metadata are as important for databases as they are for other preserved objects, but are not discussed below because database metadata present the same challenges as other metadata.  Finally, this article does not attempt to justify what it sketches.[14] 

The term ‘database’ has many meanings, including “a set of files with internal structure conforming to a well-known schema class, such as that of a relational database, and to prescriptions of a database management system (DBMS), such as IBM’s DB2.”   Another definition is “any data collection that has a relatively simple and orderly structure.”  Various meanings are discussed in a Wikipedia article that also describes structure models.  Of the meanings for ‘database’, what follows assumes the kind of relational database managed by IBM’s DB2 except where some other database type is explicitly mentioned.[15]

Relational databases (RDBs) have mostly displaced hierarchical and network databases.  Instances of the latter models can be converted to relational form.  Other providers’ relational databases can be converted to work with IBM’s DB2.[16]   In fact, any structural information can be represented relationally.[17]  Accomplishing such conversions does make explicit a challenge for any data whatsoever—distinguishing essential information from accidental information, i.e., separating authors’ intentions from irrelevancies of their written works.[18]

What differences between RDBs and ordinary files are important for preservation methodology?  I can think of only three that are critical: (1) databases tend to contain little implicit or explicit contextual information; (2) RDBs are dynamic and (3) often much bigger than the biggest ordinary files—sometimes as much as 1,000 times larger, or even more.

By “dynamic”, what is meant is that a database might have to be accessible for change even when someone wants to undertake preservation actions, with many changes occurring in any time interval similar to that required to make a database copy.

The size challenge is that it can be impractical to use digital networks to copy an entire database from one location (computing environment) to another—too costly and too slow.

The dynamic and size challenges are handled by commercial DBMSs suitably for preservation copying.  The “dynamic” challenge is handled by snapshot and logging functionalities, such as those that IBM’s DB2 has included for more than 20 years.[19]  A snapshot and a log can be combined to create a representation of the database state at any time between the snapshot execution and the time that logging ended.  Reconstituting a RDB from a snapshot and logs will be supported in commercial DBMSs for the foreseeable future.  New software is not needed.

Compared to scientific/cultural data collections, operational databases tend to be very large, to change rapidly over longer periods, and to have readily justified values to their owning enterprises.  Critical commercial databases are replicated remotely approximately continuously.[20]  For scientific/cultural databases as large as a few terabytes, a practical method of replication is by Sneakernet—parcel post of external storage devices.[21]

With this background, we can suggest answers to the questions asked by the DB workshop organizers.  From a software engineering perspective, answers to the technical aspects are readily available in commercial DBMSs.  If  the workshop would take into account of what is already known and in practice, its focus could shift to working out economical procedures packaged for cultural sector repository staffs.[22] 

What are the salient features of a database that should be preserved?  Snapshots and logs managed as described in DBMS documentation are sufficient to preserve any portion of a relational database.  The necessary snapshot and logging software support is part of any competent DBMS.[23]  The choice of database portion to preserve and related external information to provide context are subjective decisions similar to the choices for any document collection.

What are the different stages in the database preservation's life cycle?  All times in the life of a DB in use are equivalent, except that the DB content is changing.  With snapshots and logs, any desired state in the history of a DB can be preserved for later inspection.

How do we keep archived databases readable and usable in the long term (at acceptable cost)?  The formats of individual DB fields usually conform to standards (e.g., for floating point numbers), because that is part of "industrial strength" DBMS support.  Given that, the snapshots/logs mentioned above are sufficient.

How do we separate the data from a specific database management environment?  A motivation for Ted Codd's 1970 invention of relational database[24] was to make the data independent of DBMS implementation details.  A RDB is portable from the environment provided by one DBMS provider into that of any other DBMS provider.

How can we preserve the original data semantics and structure?  The structure (viz., the table, column, and field definitions) of an RDB is described in its system catalog tables, which are themselves part of this RDB.  The structure of these administrative tables is described in textbooks.[25]  As to semantics, it depends what one means by semantics.  This might be (1) how the DB responds to SQL queries and update actions, or (2) the relationship of DB content to real-world facts.  (1) is fully defined by the DB structure combined with SQL functionality.  Handling (2) is much the same as for the content of any book whatsoever, except that books tend to include or imply much more context than a typical relational database does.  For this reason, it is likely to be helpful to preserve a document collection to describe the database and its connections with the world.[26]

How can we preserve data while it continues to evolve?  Combine snapshots with DB logs.

How can we have efficient preservation frameworks, while retaining the ability to query different database versions?  Reconstitution of a prior DB state from a snapshot and logs creates a DB representation that can be queried just as the original data were queried.  Doing this would be efficient in the sense that it requires no new software and no fresh user education.

How can multi-user online access be provided to hundreds of archived databases containing terabytes of data?  Presumably the challenge alluded to is that academic repositories typically do not have the powerful computers needed to serve large databases, or that the value (to some constituency) of preserving big databases is not sufficient to justify the implied cost.

Can we move from a centralized model to a distributed, redundant model of database preservation?  This is primarily an issue of the cost and management of powerful computers and networks.  See the discussion of Sneakernets above.

What documentation is preserved together with a database, and in what format?  Those attributes of an RDB that distinguish it from other RDBs are preserved in its system catalog tables, which themselves are an RDB.  Everything that one needs to know about the latter is defined in textbooks.  As already mentioned, a preservationist needs to preserve an accompanying document collection to provide context.

What are the legal encumbrances on database preservation?  Such encumbrances are qualitatively similar to those for any other kind of intellectual property.

What can be learned from traditional archival appraisal for the selection of databases for preservation?  Selection is always a subjective decision of someone, or of some institution, deciding how to spend resources.  This is the case for DBs in the same sense as it is for books, with only costs and values different case by case.

To what extent can the preservation strategies, and procedural policies developed by archivists be adapted for databases?  This question is insufficiently defined for answers to be apparent.

Where should we look for ideas and for software for preserving databases?  Private sector database technology and deployment seem to be a decade ahead of what is discussed in academic sources apart from Computer Science departments.  For instance, Sun Microsystems is offering an immense Sneakernet implementation—an entire data center on wheels.  And seeMore’s Virtual Database Server (sVDBS) seems to include most of what might be needed to implement database preservation.[27]<