|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 1, Number 4, 4Q2002 |
|
|
HMG
Consulting 20044
Glen Brae Drive Saratoga,
CA 95070 (408) 867-5454 |
|
||||||
|
© 2002, H.M. Gladney |
|
|||||||
This number continues DDQ’s 2002 emphasis on digital content preservation, starting to identify what we believe a solution to the technical challenges.
“Preservation of digital information is complex because of the dependency digital information has on its technical environment. … digital resources present more difficult problems than conventional analog media such as paper-based books. … there is a lack of proven preservation methods to ensure that the information will continue to be readable.” [Lee]
It seems to be common opinion that there are just two possibilities for making complex content durably intelligible: transformative migration and preservation emulation. We agree with a developing consensus that the proposals for accomplishing either alternative are deeply flawed. As far as we know, next to no-one is considering other ways of addressing the challenge; comments such as Lee’s are frequently encountered, suggesting prevailing pessimism. Two failed methods do not, however, demonstrate that no method will work—that nothing new will be found. This DDQ number announces a series of articles that propose how to handle the technical aspects of digital preservation and describe work in progress.
Before sketching this proposed solution, DDQ 1(4) comments on the perception that “digital resources present more difficult problems than conventional analog media such as paper-based books”—an opinion that we feel overlooks facts about the technologies and about people’s growing expectations. Furthermore, pessimism related to short-lived media seems premature. It would be more reasonable to revive such skepticism after further attempts to preserve information, but not before.
Papyrus and Digital
Preservation
Pessimism about digital preservation is sometimes accompanied by a comparison of the durability of paper-like media with the durability of magnetic and optical media of the kinds used to hold digital bitstreams and analog recordings.
“Truth is embedded in the symbols and artifacts that we create and then keep by choice or by accident. And yet, as we approach the end of the twentieth century, we find ourselves confronting a dilemma such as the one faced by Howard Carson, Macaulay's amateur digger: a vast void of knowledge filled by myth and speculation. Information in digital form—the evidence of the world we live in—is more fragile than the fragments of papyrus found buried with the Pharaohs.” [Conway in CLIR 92]
While there is truth in what Conway asserts, this is only marginally pertinent because nobody seriously proposes that digital information be stored forever on today’s media, which are designed for very high density storage and for very rapid access to what they hold.
In fact, the specifics do not support a firm assertion that “information in digital form is more fragile than … papyrus (or paper)”. Let’s consider the circumstances more carefully than is commonly done.[1]
Ø The usual interpretation pits some surviving papyrus against an individual digital object. What fraction of original papyri have been lost? We don’t know.
Ø By the time that the surviving papyri were created, their technology had been refined for several centuries. Digital documents were first created about 40 years ago. Who knows how durable digital media will be in 2102?
Ø Papyrus that survived was probably of the best quality and might have had durability treatment as good as that applied to the deceased pharaohs. It was intended as a preservation medium; a 2002 digital counterpart would be nickel disks.[2]
Ø Good digital stuff is replicated many times, thereby increasing its likelihood of survival. Did the Egyptians create even 5 copies of important papyri?
Ø Storage spaces for the surviving papyri—Egyptian pyramids—were engineered for durability, not for ready access. The modern comparison might be a sealed stone cavern.[3]
Ø Many documents printed between about 1880 and about 1930 are on “acid paper”—paper bleached by sulphurous acid. This treatment left residual sulphuric acid that is gradually burning up the paper. For instance, consider:
“The Canadian Printer and Publisher is the most significant journal for the history of printing and publishing in Canada, and is of importance to researchers in numerous disciplines including the history of books and printing, graphic design, literary studies and business history. However, in its current printed form, the material is extremely fragile, and is quickly becoming very brittle. Eventually the paper will crumble, ultimately resulting in these primary resources being inaccessible even to the most serious scholars. Thankfully, through the use of the newest technologies we will be able to save these original materials for future generations.” University of Toronto Libraries Newsletter, Fall 2002.
A test for durability of a digital medium might be to write something onto each of 100 nickel crystals, to seal and store each crystal under 50 feet of nearly solid rock, and to inspect their contents a century later.[4]
The points above compare media and information retention—a misleading focus. The survival of a few papyri teaches something about the likely survival of paper records, but very little about the durability of digital content compared to durability of content recorded by printing or with analog formats—a quite different issue.
Digital preservation might seem more difficult than preservation on paper if you do not know how to achieve it. After methods are chosen and institutionalized, digital preservation is likely to seem simpler and to be less expensive than preservation on paper.
Why Do Digital Data Seem to
Present Difficulties?
Precisely what causes the problem? Answers to this question can help by suggesting solution characteristics.
We can read from paper without machinery, but need mechanical assistance for access to digital data for at least the following reasons:
Ø Machinery is needed for information types, such as recordings of live performances, beyond those that paper can handle well.
Ø We are generating far more content than ever before, and want to preserve more of this than would be affordable with old technology.
Ø Practical digital information density is far beyond what our eyes can resolve.[5]
Ø High performance and reliability depend on complex encoding that would make human reading impractical even if we could distinguish the bits.
Ø Machinery and software enable rapid and inexpensive searching for content of interest.
The first reason is qualitative, with digital technology providing sufficient advantages that it has replaced many analogue predecessors.[6] The other reasons are “merely” quantitative, but digital performance is hugely superior to what paper allows.
Society has chosen, implicitly valuing immediate performance over ill-defined benefits years in the future. Furthermore, the technical parameters unambiguously indicate that digital technology is the only way to solve preservation challenges that digital technology created.
Digital information handling that many people older than 30 years find unnatural and difficult is accepted as natural and easy by many in the next generation. Many of us have personal experience with that, so we do not need to belabor the implications. Nevertheless, a recent anecdote[7] might provoke a smile as it illustrates the point.
A man was puzzled by a photograph showing six toddlers, each in a big flowerpot and wearing a wreath. He was amazed that every child was smiling and looking in the same direction. He mused aloud, “How did the photographer get them all to sit still simultaneously?”
His teen-aged daughter looked over his shoulder. “Simple, Dad. They just clicked them in!”
A seldom-mentioned factor in comparisons of reading from paper and of exploiting its digital counterpart is our formal and informal education practices. We each spent much of our first decade learning to write on and read from paper; later schooling concentrates on writing well and interpreting complex information represented in natural language. However, as adults we tend to be impatient with whatever effort might be needed to master the digital replacements.[8] To interpret accurately what someone else has written (and also to write well) depends on an immense body of shared experience, shared language, and shared world views. This shared knowledge is mostly implicit in what we write.
As we consider digital preservation, we should consider what complex of shared experience is needed and how this is best provided. Part of the necessary knowledge infrastructure is likely to appear through uncoordinated actions in society. Part will probably already be in place in durable paper collections; however, much of this is likely to be difficult to find and use for the user accustomed to computer network tools. The question is what information is in places or forms that will not be readily accessible when some consumer needs it.[9] Partly because the transformation to digital communication is happening so rapidly and partly because it has social implications that are, as yet, only ill-understood, we cannot confidently assume that adequate digital knowledge infrastructure will happen without explicit attention.
In addition, our expectations for the precision and accuracy of modern information tend to be higher than ever before. Many in our society are blessed with better education, more leisure time, and better access to cultural involvement than ever before. Moreover, our practical expectations (for health care, for business efficiency, for government transparency, for educational opportunities, …) depend more on recorded information than ever before. All these factors make it worthwhile to consider structuring explicit digital representations of the shared experience, language, world views, and ontologies implicit in our social fabric.
These topics teach at least two requirements for the digital infrastructure. We should identify standards and conventions that will make it easier and less expensive to provide and use the digital infrastructure, especially what’s needed for preservation over and above what’s needed for communication. Secondly, we should design and use secure links between otherwise independent documents because correct interpretation of most digital documents will depend on the reliability and trustworthiness of references to other documents. Happily, the necessary means are known and mostly deployed; we need only to embed unambiguous identifiers and links in each digital document, to ensure that the documents cannot be altered without this being detectable by anybody who cares, and to manage digital collections defined by digital finding aids.[10]
Requirements Studies
DDQ 1(3) alluded to an inquiry seeking preservation requirements analyses; this survey yielded about 40 pointers that collectively identified half a dozen interesting documents. Detailed comments would still be premature, but the interesting sources are identified below.
The survey was to identify requirements that we might otherwise overlook in an architecture project. What was sought were indications of what was needed to make preservation tools compatible with software an institution was already using—requirements not likely to appear in published assessments. What I had in mind was:
Ø Ideally, such a document would be of sufficient quality and contain sufficient detail to be useful first in an RFP, and later as a part of a contract for offerings and/or services. In the latter role, it would be useful to test compliance and completion by the vendor.
Ø Ideally, every specified requirement would be such that compliance or shortfalls could be objectively determined.
Ø The response would further address all portions of some explicitly identified and clearly defined enterprise objectives, and for any "line item" requirement specify broad aspects of the expected deliverable (e.g., "this is expected to be a software component") or explicitly delegate such aspects up to the respondent. For instance, it might allow a service to be automated or left as a human clerical step.[11]
Ø Finally, the response would be specific to a real situation in a real institution. For instance, if the archiving service had to use a particular family of UNIX offerings in order to interoperate with pre-existing services in the enterprise, or because the personnel were trained in some particular family of application programs, this would be stated as a requirement.
I've never seen any such ideal requirements statement[12] written by a potential digital library customer; I was seeking good approximations.[13] I’ve chosen the following to be assessment benchmarks for a planned preservation architecture28, and recommend them to DDQ readers.
|
National Library
of Australia (NLA), Digital Services Project:Request for tender–digital
collection management system, at
http://www.nla.gov.au/dsp/rft/. The NLA does not propose this as a model,
but it would be instructive to anyone contemplating the infrastructure for
managing and preserving digital collections.
An attached draft contract clearly includes elements specific to
Australian Government requirements, but also includes many generic
elements. More generally, documents linked at http://www.nla.gov.au/dsp/ might
be of interest. |
|
Public Record Office Victoria, Standard for the Management of Electronic Records, PROS 99/007,
1999, at http://www.prov.vic.gov.au/vers/standards/pros9907.htm. |
|
Cornwell Management Consultants, Model Requirements for the
Management of Electronic Records (MoReq), 2001. This is a model
specification of requirements for Electronic Records Management Systems
(ERMS). It was designed to be easily
used, and to be applicable throughout Europe. |
|
The Royal Statistical Society and The UK Data Archive, Preserving
and Sharing Statistical Material, Working Group on the Preservation and
Sharing of Statistical Material:Information for Data Producers, 2002, at http://www.data-archive.ac.uk/home/PreservingSharing.pdf. |
Requirements can also be inferred from archives’ published policies, from archivists' best practice guidelines, and from archivists’ periodicals, such as the Journal of the Society of Archivists on-line publications.[14]
EU-NSF Workgroup: Digital
Preservation and Archiving
In late 2001, the European Commission and the National Science Foundation asked several international workgroups to identify research needs in certain information technologies. One group was commissioned for digital preservation. Its report should appear during 1Q2003. In the meantime, a 2-page summary is available.[15]
The full report will communicate more than a dozen research areas and justify four objectives:
Ø
Ensuring that
our descendants can understand and use any information we preserve.
Ø
Ensuring that
anybody can decide whether saved data can be trusted for his applications.
Ø
Replacing human
effort by automatic procedures whenever doing so is feasible.
Ø Empowering each information producer to package content and metadata to minimize what a professional archivist or librarian must do.
The first two
objectives address information producers’ and consumers’ interests. The final two respond to the fact that the
number of digital documents that people want to preserve might be much larger
than the number of preserved paper documents.
The Workgroup recognized that practical action is urgent, and separated its research proposals into those likely to yield results quickly and those that might take longer to resolve. Results for the first set would allow many kinds of documents to be safely archived in the near future.
Characteristics of
Preservation Solutions
The digital preservation challenge has been widely understood since the seminal [Garrett] appeared. However, the attention lavished on “archival information systems” since 1996 has not led to a persuasive solution. Perhaps this is because the work has focused on repositories and archival institutions, rather than on the content they are intended to safeguard. We therefore focus on what would be needed to preserve an individual digital document and its context.[16]
Consider an environment in which a producer—a human being who endorses a document as being authoritative—conveys information to consumers by the Internet or on storage volumes. Figure 1 suggests the content transfers that must occur. Transmission might be asynchronous, with the producer depositing the information in repositories from which consumers with whom he is not necessarily acquainted obtain it, possibly many years later. For current consumers, the producer might also transmit the information directly. The transfer will often be between machines of different hardware and software architecture; producers cannot generally anticipate what technology consumers will use.

Figure 1: Information interchange and repositories
This figure and common sense suggest solution properties and semantic challenges.
Ø There is no a priori reason to believe that digital preservation mechanisms will mimic those for information stored mostly on paper.
Ø The first two EU-NSF objectives treat information qualities apparent to end users, and are silent about repositories. This suggests that storage bin design is not a central issue.
Ø Digital preservation is an extension of digital information interchange. A solution is likely to incorporate much of the technology being developed for interchange.
Ø We might hope that information consumers understand exactly what authors intend to convey. The arrows with question marks in Figure 1 remind us that communicating intended meanings completely and accurately is impossible in principle.[17] The other arrows depict communications that might include purely syntactic transformations, and necessarily include conversions between analog and digital information representations. We need to figure out, “How close can we come to communicating intended meaning?”
Ø To avoid forcing users to treat repository communications differently than they treat inter-user transmissions, the repository ingest format and the repository delivery format should be identical to the document format that producers share with consumers.[18]
Ø Apart from standard protocols for document receipt and delivery and for inquiry[19], each archive or digital library can choose its design and procedures independently of those that other archives choose. In fact, apart from such interface protocols, digital preservation might impose no constraints on archive design.
Ø The last two EU-NSF objectives suggest that, because of the large and growing number of digital documents, archival institutions should automate all possible clerical steps.
Ø Since only content producers can know how their intended meanings map into their content encodings, archives should persuade producers to produce such mappings.[20] Success in this would alleviate significant administrative and trust problems.
Ø To stem serious on-going digital content loss, it is desirable to exploit current digital library offerings and related infrastructure with a minimum of change.[21]
Nothing: Trickier than You Might Think
As we turn from requirements analysis towards practical answers, it is time to be concerned about avoidable complexity, partly because of the cost implications, but more because what’s presented to end users should be simple and intuitive. We want to write as few new programs as possible, because even the simplest program increases project costs, is likely to require users to learn new things, and might harbor bugs that tend to be even more expensive to users than they are to software providers.
The reader might pause to consider a riddle before reading its answer immediately below. Suppose you want a program so general that it contributes equally to every application. What program would you write?
—————————
The answer to the riddle is that you should write a program that does nothing at all.
This answer teaches that, to make a program more general than it already is, you should reduce its functionality—take things out of it—rather than adding new function.[22]
The following anecdote might help you remember this principle, which motivates making [OAIS] submission, distribution, and archival information packages identical18 whenever possible and will be seen to influence other aspects of our preservation solution.[23]
You might think that writing a null program is very easy; the anecdote suggests otherwise.
When assembly-language programming was still common, a programming manager asked a recruit to produce a version of the IBM OS/MVS null program called IEFBR14.[24] IEFBR14 was to use standard MVS calling conventions; all it was to do was return successfully.[25]
The first version was something like:
IEFBR14 START
BR 14 Return address in R14 -- branch at it
END
First bug: An MVS program indicates successful completion by zeroing register 15 before returning; this version of the null program "failed" every time. The second version was:
IEFBR14 START
SR 15,15 Zero out register 15
BR 14 Return address in R14 -- branch at it
END
Much better. However, this caused problems with the MVS linkage editor, since the END statement didn't specify the subroutine entry point. The third version was:
IEFBR14 START
SR 15,15 Zero out register 15
BR 14 Return address in R14 -- branch at it
END IEFBR14
At least now, the null program was functionally correct. However, dump analysis was impaired because the program didn't include its own name as an "eyecatcher"—a time-honored convention. Null program, mark four, was:
IEFBR14 START
USING IEFBR14,15 Establish addressability
BR GO Skip over our name
DC AL1(L'ID) Length of name
ID DC C'IEFBR14' Name itself
DS 0H Force alignment
GO SR 15,15 Zero out register 15
BR 14 Return address in R14 -- branch at it
END IEFBR14
The next change had something esoteric to do with save-area chaining conventions—again, to keep dump analysis tools happy. Notice that the "null program" has tripled in size: both in the number of source lines and in the number of instructions executed!
About a year after the program was shipped, a bug was reported! "This program wasn't link-edited with the "RENT" attribute (to make it re-entrant) and it won't go in the VS link pack area!"
This sad story illustrates that even the simplest program could have a bug in it, that all programs should be tested somehow, and that all programs need on-going maintenance and support.
Such stories lead us to examine every new program proposed, to inquire whether it can be simplified, whether it can be partitioned into simpler reusable pieces, and whether it is needed at all. We try to exploit the prior and continuing expenditures for information interchange and for digital content management by using much of the infrastructure they create—doing so without asking for changes in existing program offerings and minimizing the new programs we request.
Happily, we find that existing software provides most of what is needed for preservation.
Towards a Solution for the
Technical Aspects
We believe that we know how to address the first two objectives identified by the EU-NSF Workgroup and that we can propose how to handle much simple content that is worth preserving in a way that can be extended to difficult types of digital entities. A by-product of the solution we will propose is that it will help minimize the human effort needed.
Part of this solution is encapsulation of content bitstreams and metadata in XML structures that include encrypted authenticity certificates. Another element is the use of a virtual computer to propagate information free of irrelevancies describing the computing environment and software that happened to be used to edit the content data. A series of articles is being written to provide specifics and identify work still needed.
A preliminary version of the first article, Trustworthy 100-Year Digital Objects: Durable Encoding for When It’s Too Late to Ask, is planned for January 2003 release and will be available on request. It focuses on encoding the individual content bitstreams that will be embedded in the XML structures, and has the following abstract:
How can an author store digital information so that
readers are likely to understand it as he intends, even years later when he is
no longer available to answer any question?
Methods that depend on something that might
work in the future are not good enough; data preserved today must be reliably
interpretable whenever someone wants to use them. Prior proposals—called “migration” and “emulation”—fail because
what they save is confounded with details about today’s information
technology—details that are difficult to define, extract, and save completely
and accurately. We present an
alternative, doing so in a style intended for readers not steeped in computer
programming.
Specifically, we discuss a method for creating
content bitstreams that do not depend on information irrelevant to what their
producers intend to communicate.[26] Its central idea is to specify and employ a
simple “universal virtual computer” (UVC) that can handle any computation
whatsoever. Today’s conservator would
obtain or write a UVC program to interpret whatever information should be
useful in the future. If her tests show
that we have correctly specified every detail of the UVC and of the UVC
programs, her descendants will surely be able to execute these programs on the
computers of their time, and they would generate interpretations of the
information saved.
This general solution might be more elaborate than
needed to interpret ordinary text, image, audio, or video data. Sufficiently simple files can be preserved
by encoding them in conformance with well-known standards. We project practical methods for files
ranging from simple structures to those containing computer programs, treating
simple cases here and deferring more complex cases and programs for future
work.
The reader might be skeptical. Since attempts to solve the problem began several years ago, but have not yet been successful, we would agree that skepticism is reasonable. Peer review is conventional for resolving such doubt; we invite readers’ specific questions and criticisms.[27]
It would be naive to think that providing digital preservation is simply a technical challenge. Full success would hinge on agreements of the parties involved in creating new content and of archiving institutions for a range of technical, organizational, and social issues. However, such agreements cannot be achieved until supportive technical knowledge is available and broadly appreciated. The Trustworthy 100-Year Digital Objects series will try to sketch every essential technical component, leaving social, legal, and organizational aspects to other works.
Trustworthy … articles are planned to treat (1) encapsulation and cryptographic sealing of content with sufficient metadata to convert documents into evidentiary records; (2) the limits of automation for preservation as determined by the boundary between objective facts and subjective opinions; (3) conveying authentic meaning as intended by authors, and minimizing accidental aspects of messages; and (4) economics for digital content resources. An overview article will show how the solution components support each other. Absent any unexpected problem, the articles will appear at monthly intervals.
A top-down digital preservation architecture is also being prepared.[28] It treats preservation as incremental to readily available digital library software and to basic infrastructure, doing so in such a way that implementations will be modest extensions from whatever software and human procedures an institution might already have. I.e., digital preservation will be upwards compatible from almost any digital library offering or digital content management product.
The top-down approach taken in the Trustworthy … series has strengths as well as some significant weaknesses. The strengths are the possibility of eventually discovering everything needed for the problem at hand and of helping identify opportunities that combine urgent applications and relatively inexpensive solutions. The weaknesses are that, by itself, this work does not provide a complete practical solution and that it risks not delving deeply enough to discover hidden problems that hinder a true and comprehensive solution.[29]
MIT recently announced plans to offer free
access to its online curriculum content—an initiative which might set a
world-wide trend, making a profound impact on scholarship and education.[30]
10 Choices Critical to the
Internet's Success
We spend more energy jeering at our government than cheering at its successes. Scott Bradner of Harvard University provides a tidy opinion of governmental decisions that contributed to the immense success of the Internet.[31] We should all cheer!
RSA 64-Bit Encryption
Broken for One Message
A team of more than 300,000 volunteers using spare time on home computers “broke” a 33-character secret message. The search took 4 years of 46,000 2 GHz machine equivalents.[32]
This suggests that, if processing power continues to double every 18 months and if no mathematical or quantum computing invention intervenes, 128-bit keys will be good enough for most preservation authenticity signatures for another century.
E-Government Act of 2002
(HR 2458)
On November 15, the Senate passed H.R. 2458, the E-Government Act of 2002 and cleared it for the President. This legislation establishes a broad framework that requires using Internet-based information technology to enhance citizens’ access to Government information.[33]
NISO Guide to Standards for
Library Systems
The National
Information Standards Organization (NISO) has published The RFP Writer's
Guide to Standards for Library Systems, a manual intended to aid in library
procurement of software. For free
download, see http://www.niso.org/standards/resources/RFP_Writers_Guide.pdf.