|
Digital Document Quarterly Perspectives on Trustworthy Information |
Volume 6, Number 1, 1Q2007 |
|
|
HMG Consulting |
© 2007, H.M. Gladney ISSN: 1547-8610 |
Digital preservation activities seem to be entering a new phase, shifting from need to solve basic problems to need to implement solutions and establish repository procedures. This is signaled by new books attempting to treat their topics comprehensively.
Public symposia and workshops focusing on digital preservation seem ever more frequent, with roughly one somewhere in the world each month. As the topic is maturing, it would help everyone interested if conference organizers, and also authors, would clearly indicate their focal levels, e.g., one of
· Fundamental principles for knowledge and information preservation (epistemology);
· Methodological principles and software architecture;
· Software tools and other technology implementations;
· Archival institution management; or
· Tutorial presentations, particularly for librarians and archivists.
A Reader’s Comment on Preserving “Everything”
John Erickson commented on recent DDQ numbers: [1]
“… the myth of expert archiving and dangers of not
preserving "everything." There
has been much hubbub over the past few years (partly inspired by the Da Vinci Code) over non-canonical
documentation of early Christianity and how it should be considered. Documents such as the Nag Hammadi (
“… a small sampling of "alternative" documents can have a major impact, … [suggesting] preserving as much as possible. Surely there were other contemporary sources which did not survive. And, considering how the "Canon" was formulated (esp. via the Council of Nicene) and its subsequent effect on what was preserved or destroyed, … "expert" preservation [should] be considered dangerous and suspect, because every such instance is merely an opinionated selection … of what will be [wanted] by future scholars.
“… related to this issue of "preserving as much as possible" [as described in] MyLifeBits[2] … exemplifies the problem of not knowing at artifact creation time what data might be useful or required at use time. Most efforts to simplify the data set will invariably reduce the usefulness of the artifact. A scientist knows never to destroy original data, but our IT culture does it every day; in fact, many corporations have policies for deleting old data for both policy reasons and to save space.
Erickson’s reminders of the selection challenges of building a long-term digital collection are appropriate, but need to be balanced by practicalities.
Saving “everything” needs to be tempered by anticipation of future user’s ability to find what might interest him. An extreme case illustrates the point. During one of the Government-IBM antitrust lawsuits, the judge ordered that every document copy which a recipient had annotated was to be saved—including telephone books. IBM instructed employees to deposit all discards in large collection bins. Allegedly all the content was dumped in warehouses, without any organization or inventory. A hallway joke was that, whenever the court order would be lifted, the price of scrap paper would drop. And it did!
The concern with derivative documents that include editorial change, illustrated with the Nicene Creed, can readily be overcome by responsible conservators. If they accompany each edited document by honest and adequate provenance information that is firmly bound within a resultant archival object, that object will not mislead future scholars. In fact, it will be an authentic archival object.[1]
New Books and Reports on Digital Preservation[3]
The five books reported below are so new that, as far as I know, no independent reviews have yet appeared. It would help the cultural heritage community for some DDQ reader to review some number of these books.
Deegan & Tanner, Digital Preservation[4]
This essay collection sketches recent progress, including how digitization is changing in archives, libraries, and museums. The essays also suggest, in view of digital preservation as a moving target, the value of periodic overviews for deciding next activities. Chapters cover:
· Key issues in and strategies for digital preservation
· The status of preservation metadata in the digital library community
· Web archiving
· The costs of digital preservation
· European approaches to digital preservation
· Digital preservation project case studies.
Digital Preservation is intended to be a guide for information managers, librarians and archivists, as well as for students in library and information studies courses.
Borghoff et al., Long-Term Preservation of Digital Documents[5]
These authors
briefly describe markup and document description languages (TIFF, PDF, XML,
Dublin Core, …), explain migration and emulation techniques, and present the OAIS (Open Archival Information System) Reference Model. To complement this technical background, they
present selected repository projects (at
This work is
intended for librarians, computer scientists, and information managers engaged
with social and methodological requirements for long-term information access.
Masanès has
collected essays about tools, tasks, processes, and standards needed to preserve
portions of the WWW. His book can serve
as an introduction to keeping online information alive. It covers issues related to building, using
and preserving Web archives for computer scientists and librarians. This book is intended to be a
state-of-the-art overview for practitioners.
Batini &
Scannapieco, Data Quality[7]
Batini and
Scannapieco systematically introduce quality issues for federated data[8], web data, and other time-dependent data,
classified according to frequency of change.
The book describes methodologies from core data quality research as well
as data mining, probability theory, statistical data analysis, and machine
learning. It ends with critical
comparison of tools and practical methodologies for data quality problems.
This book is
broadly targeted—for researchers, students, and engineers who want an introductory
course or self-study on its topics.
Gladney, Preserving
Digital Information[9]
Preserving an information collection is a different challenge than managing archives. Preserving Digital Information addresses[10] fundamentals and software design for preserving a file collection[11] indefinitely—methodology claimed to be complete and optimal for any information types (representations) whatsoever. In a nutshell:
· Any information pattern can be protected against loss by replicating its carrier object in independent repositories.
· A perpetually unique document identifier (easily constructed) that is embedded in each preservation object will enable durable indexing for global search engines.
· Ensuring that eventual users can exploit any preserved document as its authors intend can be achieved by augmenting its source version with representation in a lingua franca appropriate to its genre.
· Making a document trustworthy can be achieved by firmly binding evidence, using cryptographic signatures of individuals or enterprises that have little to gain and much to lose by endorsing misrepresentations.
· Essential document relationships that define collections and critical dependencies can be reliably preserved by binding inter- and intra-document links to digital document hash codes.
· Everything that is new and essential[12] can be implemented by straightforward extensions of office document software based on a relatively small number of widely used international standards, such as core portions of XML, character coding standards, cryptography, and the Church-Turing thesis.
A synopsis and table of contents is available on-line. Another Web page provides actionable links to the book’s citations of Web-accessible references.
We call the two methodological components contributed by my colleagues and myself “durable encoding” and “durable evidence”.[13] A durable encoding prototype has been implemented in the National Library of the Netherlands DIAS and is being elaborated in Kopal. Durable evidence methodology is being pursued in ArchiSafe. Both projects project enjoy German Government funding.
A workshop announcement suggests that attention to preserving databases is timely:
Most of scientific research
is now based on digital data resources, and databases are playing an
increasingly important role. Much of the
data is either impossible … to reproduce or can only be recovered at enormous
costs …. Nearly every reference manual,
dictionary and gazetteer benefits from some form of database management
support, … The need for preservation is self-evident.
While considerable
thought has been given to the preservation of fixed "digital objects"
studied in the past, the preservation of databases, which have an internal
structure and which may change over time, poses new challenges. …
Libraries, the traditional curators of … reference material, have
largely abrogated their archival responsibility to databases. Database preservation raises new technical,
economic and legal issues.
This announcement continues by posing 13 questions (reproduced below). Addressing even these questions thoroughly seems beyond what a one-day workshop can accomplish. The workshop purpose might be advanced by identifying what is already known about database preservation. On the other hand, an unasked question seems worthwhile: Where should we look for ideas and for software for preserving databases?
To keep what follows brief, we limit its scope and style. As in the workshop announcement, what follows assumes that how to preserve static files is sufficiently understood. It further assumes that, if a file can be reliably and reversibly converted to a file format that we know how to preserve, we can preserve that file by converting it to the understood format and saving that together with the reverse conversion rule.
Metadata are as important for databases as they are for other preserved objects, but are not discussed below because database metadata present the same challenges as other metadata. Finally, this article does not attempt to justify what it sketches.[14]
The term ‘database’ has many meanings, including “a set of files with internal structure conforming to a well-known schema class, such as that of a relational database, and to prescriptions of a database management system (DBMS), such as IBM’s DB2.” Another definition is “any data collection that has a relatively simple and orderly structure.” Various meanings are discussed in a Wikipedia article that also describes structure models. Of the meanings for ‘database’, what follows assumes the kind of relational database managed by IBM’s DB2 except where some other database type is explicitly mentioned.[15]
Relational databases (RDBs) have mostly displaced hierarchical and network databases. Instances of the latter models can be converted to relational form. Other providers’ relational databases can be converted to work with IBM’s DB2.[16] In fact, any structural information can be represented relationally.[17] Accomplishing such conversions does make explicit a challenge for any data whatsoever—distinguishing essential information from accidental information, i.e., separating authors’ intentions from irrelevancies of their written works.[18]
What differences between RDBs and ordinary files are important for preservation methodology? I can think of only three that are critical: (1) databases tend to contain little implicit or explicit contextual information; (2) RDBs are dynamic and (3) often much bigger than the biggest ordinary files—sometimes as much as 1,000 times larger, or even more.
By “dynamic”, what is meant is that a database might have to be accessible for change even when someone wants to undertake preservation actions, with many changes occurring in any time interval similar to that required to make a database copy.
The size challenge is that it can be impractical to use digital networks to copy an entire database from one location (computing environment) to another—too costly and too slow.
The dynamic and size challenges are handled by commercial DBMSs suitably for preservation copying. The “dynamic” challenge is handled by snapshot and logging functionalities, such as those that IBM’s DB2 has included for more than 20 years.[19] A snapshot and a log can be combined to create a representation of the database state at any time between the snapshot execution and the time that logging ended. Reconstituting a RDB from a snapshot and logs will be supported in commercial DBMSs for the foreseeable future. New software is not needed.
Compared to scientific/cultural data collections, operational databases tend to be very large, to change rapidly over longer periods, and to have readily justified values to their owning enterprises. Critical commercial databases are replicated remotely approximately continuously.[20] For scientific/cultural databases as large as a few terabytes, a practical method of replication is by Sneakernet—parcel post of external storage devices.[21]
With this background, we can suggest answers to the questions asked by the DB workshop organizers. From a software engineering perspective, answers to the technical aspects are readily available in commercial DBMSs. If the workshop would take into account of what is already known and in practice, its focus could shift to working out economical procedures packaged for cultural sector repository staffs.[22]
What are the salient features of a database that should be preserved? Snapshots and logs managed as described in DBMS documentation are sufficient to preserve any portion of a relational database. The necessary snapshot and logging software support is part of any competent DBMS.[23] The choice of database portion to preserve and related external information to provide context are subjective decisions similar to the choices for any document collection.
What are the different stages in the database preservation's
life cycle? All times in the life of a DB in use are
equivalent, except that the DB content is changing. With snapshots and
logs, any desired state in the history of a DB can be preserved for later
inspection.
How do we keep archived databases readable and usable in the
long term (at acceptable cost)? The formats of individual DB fields usually
conform to standards (e.g., for floating point numbers), because that is part
of "industrial strength" DBMS support. Given that, the
snapshots/logs mentioned above are sufficient.
How do we separate the data from a specific database
management environment? A motivation for Ted Codd's 1970 invention
of relational database[24] was to make the data independent of DBMS
implementation details. A RDB is portable from the environment provided
by one DBMS provider into that of any other DBMS provider.
How can we preserve the original data semantics and
structure? The structure (viz., the table, column, and
field definitions) of an RDB is described in its system catalog tables, which
are themselves part of this RDB. The structure of these administrative
tables is described in textbooks.[25] As to semantics, it depends what one
means by semantics. This might be (1)
how the DB responds to SQL queries and update actions, or (2) the relationship
of DB content to real-world facts. (1) is fully defined by the DB
structure combined with SQL functionality. Handling (2) is much the same
as for the content of any book whatsoever, except that books tend to include or
imply much more context than a typical relational database does. For this reason, it is likely to be helpful
to preserve a document collection to describe the database and its connections
with the world.[26]
How can we preserve data while it continues to evolve? Combine snapshots with DB logs.
How can we have efficient preservation frameworks, while
retaining the ability to query different database versions? Reconstitution of a prior DB state from a snapshot and logs creates a DB
representation that can be queried just as the original data were queried. Doing this would be efficient in the sense
that it requires no new software and no fresh user education.
How can multi-user online access be provided to hundreds of
archived databases containing terabytes of data? Presumably the challenge alluded to is that academic repositories
typically do not have the powerful computers needed to serve large databases,
or that the value (to some constituency) of preserving big databases is not
sufficient to justify the implied cost.
Can we move from a centralized model to a distributed,
redundant model of database preservation? This is primarily an issue of the cost and management of powerful
computers and networks. See the discussion
of Sneakernets above.
What documentation is preserved together with a database,
and in what format?
Those attributes of an
RDB that distinguish it from other RDBs are preserved in its system catalog tables,
which themselves are an RDB. Everything that one needs to know about the
latter is defined in textbooks. As
already mentioned, a preservationist needs to preserve an accompanying document
collection to provide context.
What are the legal encumbrances on database preservation? Such encumbrances are qualitatively similar to those for any other kind
of intellectual property.
What can be learned from traditional archival appraisal for
the selection of databases for preservation?
Selection is
always a subjective decision of someone, or of some institution, deciding how
to spend resources. This is the case for DBs in the same sense as it is
for books, with only costs and values different case by case.
To what extent can the preservation strategies, and
procedural policies developed by archivists be adapted for databases? This question is insufficiently defined for answers to be apparent.
Where should we look for ideas and for software for preserving databases? Private sector database technology and deployment seem to be a decade ahead of what is discussed in academic sources apart from Computer Science departments. For instance, Sun Microsystems is offering an immense Sneakernet implementation—an entire data center on wheels. And seeMore’s Virtual Database Server (sVDBS) seems to include most of what might be needed to implement database preservation.[27]<