Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 1, Number 2, 1Q2002

 

©  2002, H.M. Gladney

DDQ Table of Contents

Citations

Glossary

* 

HMG Consulting

20044 Glen Brae Drive

Saratoga, CA 95070

(408) 867-5454

 

 

 

 

 

 

The current issue continues the DDQ 1(1) focus on preservation of digital documents, a topic discussed in an increasing number of reports, lectures, and workshops.  Most of these merely convey the existence and nature of the challenge; a smaller set focuses on remedies based on collection management and institutional procedures that probably need to change. 

In contrast, DDQ emphasizes technical measures centered on representation of individual documents,[1] responding to basic interests of consumers—that they can prudently decide whether or not information received is sufficiently trustworthy for how it is to be used.[2]

We believe that:

     Sufficient fundamentals are known for making any digital document survive forever in useful form.  However, considerable engineering effort will be needed to provide practical and economical prescriptions for repository managers and end users.[3]

     Basing institutional practices on the solution for individual documents will achieve economies not inherent in the current research library consensus direction. 

     Achieving trust by audit trails built into digital documents will be surer and cheaper than seeking this objective by defining institutional procedures certified by outside committees.

     Current preservation research insufficiently emphasizes well-known engineering practices that include forestalling potential failures, objective assertions about system properties and document authenticity, and technology that helps any end user test such assertions.

     Current uncertainties and confusions can be reduced by clear distinctions between assertions that are facts and those that are opinions.  Early 20th-century philosophy teaches the extent to which such distinctions are possible and the pertinent limits of language.  To proceed without this foundation is hazardous.[4]

The reader might correctly expect careful exposition of these issues to be both lengthy and tedious, because the topic at hand is preserving forever whatever kinds and amounts of information people might choose.  Doing so reliably requires that we anticipate and forestall every likely failure.[5]  However, while the engineering needed for a comprehensive and sound solution is likely to seem slow and expensive, the importance and applicable breadth of reliable solutions more than justify the anticipated effort.  In fact, accomplishing quality design before investing heavily in preservation will pay for itself many times over.[6] 

DDQ avoids burdening any reader with more detail than (s)he cares for by a top-down, divide-and-conquer exposition that parallels common engineering practice for large artifact design.  DDQ 1(2) enlarges on this.

Recent Reports

Acquiring a clear and comprehensive grasp of digital preservation can be time-consuming.  The literature is voluminous and highly redundant,[7] but it must be inspected if one wants to catch everything new and significant.  However, a few recent reports, considered together, convey the state of the art and also articulate the dominant opinion of how to proceed. 

An NSF scientific computing proposal is on a different topic, but requires critical attention.

(1) U.S. National Strategy for Digital Preservation

The National Digital Information Infrastructure and Preservation Program was established by Congress to "develop a national strategy to collect, archive, and preserve the burgeoning amounts of digital content, especially materials that are created only in digital formats, for current and future generations."  As an initial step, the Library of Congress scheduled a series of meetings with groups from the technology, business, entertainment, academic, legal, archival, and library communities.  Background papers were commissioned to give "environmental scans."  These papers, collected as Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving [CLIR/LC], are available on the new NDIIPP web site. 

(2) U.K. Cedars Final Report

The CEDARS project (1998-2002 at Leeds, Oxford, and Cambridge Universities) aimed to provide best practices guidance for long-term digital preservation.  In May it completed with:

       The Cedars Guide to Intellectual Property Rights

       The Cedars Guide to Preservation Metadata

       The Cedars Guide to Digital Collection Management

       The Cedars Guide to Digital Preservation Strategies

       The Cedars Guide to the Distributed Archiving Prototype

These reports collect British opinion on digital preservation, but contain little that is not better expressed elsewhere.

(3) Glasgow Effective Records Management

[Currall] teaches aspects of digital repository service in ways that are ready to be applied, and can therefore be recommended.  This report discusses:

       identifying the roles and responsibilities of records creators;

       creating, using, disseminating, and eradicating digital records;

       training users in digital record management practice;

       improving retrieval speed for documents and elements within them;

       increasing accuracy for items within a document and individual document choice from a collection; and

       reducing organizational risk from unmanaged records.

(4) Prototype Digital Archive Technology at SDSC

The U.S. National Archives and Records Administration has for about two years sponsored prototype archival storage management development at the San Diego Supercomputer Center.  There’s enough meat in this that I have not yet absorbed its teachings and implications.  [Ludasch] will interest the technically-inclined reader.

(5) “Trusted Digital Repositories …”

DDQ 1(1) commented critically [HMG2] on the mid-2001 draft of a Trusted Digital Repositories report [RLG], partly because its reliance on “Trusted Computing Base (TCB)” ideas would lead to expensive overkill.  The final 68-page report [RLG2] corrected this, but did not respond to other problems communicated in October. 

     Certification of archive procedures[8] is inherently more expensive and less reliable than basing an archive on stored objects that individually embed their own audit trails;

     Technology cost trends favor data replication and other storage design elements not considered, such as not requiring the centers of trust to be the only repositories; and

     [RLG] did not consider how many documents might deserve archival preservation—a number potentially so large that procedures proposed might be unaffordable.

The implications of the word “trusted” are taken up below.  Beyond that, since the issues raised by [RLG 2] are too tricky for my analysis to be complete, I’ve posted a slightly edited version of the October critique. The interested reader might think through all this himself while we consider how to continue.[9]

(6) NSF Scientific Computing Proposal

In April, an NSF Advisory Panel released Revolutionizing Science and Engineering through Cyberinfrastructure.  This first draft has flaws, such as insufficiently identifying what will be accomplished with the $650M it calls for.[10]  Even though the panel requested comments by 1st May, it is likely to address criticism received later.  In view of the amount sought from taxpayers, you may wish to inspect the proposal and comment.

(7) Promising Items Just Received

Towards a Continuing Access and Digital Preservation Strategy … is a U.K. call for comment. 

By e-mail I received a 2001 NASA ESDIS Data Center Best Practices and Benchmark Report. 

OAI has released the Protocol for Metadata Harvesting v.2.

What is the Information Revolution?

DDQ 1(1) considered When Was the Information Revolution?  We might further wonder what changes have happened, are happening, or will happen, and how they should influence decisions for digital archives.  Four years ago, href="ddqcites.htm#Neal98">Neal] ventured an academic librarian’s perspectives; shortened, reorganized, and adapted for the passage of time and a somewhat different viewpoint, a subset of his 25 cataclysmic changes in information exploitation is:

·         Individualized expanding power to access, communicate, and analyze information.

·         Office tools automating the mechanical aspects of writing and enabling self-publication.

·         Knowledge management technology making data gathering and management valuable.

·         Hypertext reducing the tedious drudgery of following bibliographic references.

·         Virtual reality enabling experiential entertainment, education, and engineering.

·         Digital storage enabling storage and search of vast multimedia resources.

·         Networks expanding personal communication and information sharing.

·         Push transmission enabling massive customized communication to select audiences.

·         Cellular communication freeing us from constraints of location.

·         Intellectual property made valuable and more at risk by easy copying and distribution.

·         Security and encryption providing us all with tools to control information access.

·         Lowered barriers easing market entry and experimentation for innovative services.

·         Self-service (such as ATM banking) for enterprise efficiency and customer convenience.

·         Outsourcing permitting us all to shed routine tasks in order to focus on core competencies.

·         Partnerships with more cooperation and sharing as essential for success and efficiency.

·         Concentration in which organizations absorb each other seeking market advantage.

·         Student population shifts to include social groups that have not been much involved.

·         Global awareness with rapid communication international­izing all aspects of life.

·         Political and social volatility enhanced by easier, faster information use and misuse.

These factors are stimulating organizations to plan new services.  Enterprise survival is often an issue.

The list reflects the writings, talks and informal conversation of librarians and scholars.  Is it in any sense complete?  Probably not.  For instance, it does not hint at induced employment shifts.

If any such perception[11] will be used to guide policy and action, it merits careful investigation to estimate its economic significance and influence on decisions, e.g., for managing libraries [Wolf].  We have too many opportunities and not enough skills and resources.  We need to decide which designs have greatest leverage and will most please users—both end users and service personnel.  We need to make choices and would do this better with more quantitative estimates than have been published.

While quantitative estimates are needed, even the qualitative nature of Neal’s list merits more inquiry.  For instance, for the Information marketplace, Neal wrote, “The information as commodity revolution is increasingly viewing data and its synthesized products, knowledge, as articles of commerce and sources of profit rather than property held in common for societal good.”  This can be read as a social critique demanding remedy—and some people do so with evident emotion.  However, this interpretation should be balanced by recognition that the information business is the livelihood of an increasing population fraction and also that the information easily available without payment is also increasing rapidly.[12] 

Fundamentals

“A good archival system will avoid value judgments whenever possible and will publicly expose its mechanisms, leaving the remainder as value claims that the eventual user of each stored datum needs to accept or reject according to her own values and needs.” [DDQ 1(1)]  We are seeking an archival document representation scheme that can express everything that admits unambiguous, objective, and testable representation—and nothing more, leaving to other language all values and judgments that must forever remain debatable.[13]  Whether or not a document will be trusted is such a value judgment.

Testable is critical.  Every important property of each archived document and of each archiving system component must be explicitly specified (asserted), and the archived data must include whatever is needed to test these assertions.[14]  These assertions and test procedures—what certified public accountants call audit trails and audit tools—must be firmly bound to each document.  The assertions will nearly always need to identify “whodunit”, and the “whodunit” information will itself be documents or document portions that require their own audit trails.  This collection of documents and procedures is what we mean by external evidence of authenticity and provenance, and in combination with internal evidence[15] is what can make the document trustworthy.  An eventual user will trust a document if such associated evidence is accessible, reliable, and sufficient for the use that (s)he intends to make of the document content.

This might seem a prescription for endless tests that themselves must be tested.  How can we avoid recursive explosion—a chain reaction that blows up in our faces?  We can share each individual fact and test among many objects and can end each recursion with facts that are widely known and trusted.[16]  The technical mechanisms will use identifiers, references, pointers, links, and XML namespaces, and the document representations will be isomorphic to directed acyclic graphs rather than merely trees.[17]  DDQ starts to describe the technology below.

How Can We Use Wittgenstein’s Philosophy?

Who is to decide what evidence is important and what tests are worth applying?  That has to be the end user—the person for whom the document in question is preserved.  This will invariably be an economic decision—a weighing of the cost of evidence testing against the costs of a wrong decision.  A role of technology is to minimize these costs.

For this we want the soundest possible foundation and believe we can do no better than a basis in Wittgenstein’s work.  He provided a nexus in the theory of language and logic (and therefore computing), arguably completing inquiries by Bertrand Russell, Emmanuel Kant, Ernst Mach and many lesser-known central European philosophers, and setting the stage for later work by Rudolph Carnap, Kurt Gödel, Alan Turing, and John von Neumann.  The famed Wiener Kreis analyzed the Tractatus Logico-Philosophicus (TLP) twice, and later investigation rarely reached backward earlier than TLP.

We might want a concise and complete prescription for communicating meaning, but trying to compress into a few words what hundreds of philosophical writings have not completed would be foolish.  Nevertheless, a focus on Wittgenstein’s notion of a rule may help towards understanding how his teachings apply to digital preservation methodology.[18]  We cannot convincingly say why Wittgenstein’s teachings are the best available ground, but we can show what use we make of them.  What the reader should look for to evaluate DDQ analyses is how it uses thinking taught by Wittgenstein.  Specifically:

     We are sensitized to the limits of language and the consequences of misunderstanding communications expressed in words.  In particular, potential misunderstandings between different professions figure prominently.

     We look at how authors use key words, and try to decide whether each sentence is a statement of a fact or an opinion.

     Such distinctions help us stay within the engineering role, avoiding inadvertently infecting technical solutions with value judgments that clients might find inappropriate.

     The limits of what can be automated or specified as clerical tasks are rules that expand into a finite number of steps for any particular objective.

     LW’s discussions of language suggest that we cannot articulate meaning any better than is accomplished by current ontologies (reference models) and the RDF language.

The critical distinction is perhaps best expressed by Paul Engelmann’s metaphor:

"Positivism holds—and this is its essence—that what we can speak about is all that matters in life, whereas Wittgenstein passionately believes that all that really matters in human life is precisely what, in his view, we must be silent about.  When he nevertheless takes immense pains to delimit the unimportant [i.e., the scope and limits of ordinary language], it is not the coastline of that island which he is bent on surveying with such meticulous accuracy, but the boundary of the ocean."    [Janik, p.191]

Although we cannot define the boundary in words, we can illustrate it by tabulating word pairs:

Island (terra firma)

Ocean

Comments

objective

subjective

We can relate any subjective assertion to some objective one, e.g., “DDQ writes that the digital preservation literature is highly redundant.”

Natural Philosophy

Ethics

Until academic topics became highly differentiated early in the 20th century, these were the major categories of learning.

physics, mathematics, logic

metaphysics

The Greek meta- means beyond.  I.e., metaphysics is the part of philosophy that deals with value judgments.

facts

Values and opinions

Unfortunately, value has many fundamentally different meanings; here the sense of “value judgments” is intended

die Darstellung → representation

die Vorstellung →
idea

English translations of Wittgenstein’s die Darstellung are sometimes misleading.

evidence

opinion

evidence comes from the Latin evidens—visible, clear, evident; also from ex + vident—according to + what they see

  

  

Many more word pairs might appear here.

form or syntax

meaning or semantics

In modern markup languages, XML deals with form and structure, RDF with meaning.

action

intention

[LW 39, II] treats “intention” carefully.

intensional

extensional[19]

Sets and sequences are communicated either intensionally—“(3n+1) where n is a natural number”—or extensionally—“1, 4, 7, 10, …”.

“Prove it!”

“Let your light so shine before men, that they may see your good works, …”[20]

Wittgenstein’s view about what cannot be said, but must be shown, is foreshadowed in the Bible.

  

  

Still more word pairs might appear here.

rules, procedures

intuitions

Whether a man “knows or not is simply a question of whether he does it as we taught him; it’s not a question of intuition”. [LW 39, II]

possibly computable

surely not computable

The technical word in mathematics and computer science is decidable rather than computable.

trustworthy

trusted

The creator of an object can make it trustworthy, but only its end user can judge whether or not it is to be trusted.

Imagine having to write an essay comparing the value of Shakespeare’s Hamlet to that of Lucasfilm’s Star Wars.  Your statements might oscillate from side to side of the boundary.

Originators, editors, and librarians can make a preservation document trustworthy (deserving of trust) by binding metadata and test procedures to it reliably, i.e., so that the binding is firm over time and not itself susceptible to undetectable modification.  Whether or not the document will be trusted by its eventual user will depend both on how well those who prepare it choose and bind the metadata and tests and also on the user’s judgment.

Trust, Trusted, Trustworthy 

The very naming of [RLG 2], “Trusted Digital Repositories”, is troublesome because it sets a trap for the unwary—a sort of false advertising.  “Trusted” should not be used when what is meant is “trustworthy”. 

We might have passed silently over this point were it not for pervasive confounding of objective and subjective aspects of digital preservation.  Trust is of fundamental importance in document delivery services, not only for scholarly work [Lynch], but also for business transactions that will grow to include pharmaceutical development records and perhaps even personal medical records.[21]  Some business applications are tempting fraud opportunities.

The [RLG2] authors did not invent sloppy use of trusted.  The term appeared in 1975 in the Trusted Computing Base (TCB)—a processor kernel whose behavior could not be improperly altered by any executing process—intended for critical control and cryptographic services.[22]  The designers called the kernel trusted because they also built the trusting entity—an operating system that needed the TCB to meet defense security criteria.  This operating system was itself protected by the kernel against modification, and was certified as secure by teams other than its implementers. [NCSC2]  To meet its objectives, the secure operating system needs to trust the kernel; i.e., we can say the TCB is trusted by the operating system.

In contrast, the relationship between a digital repository and its users cannot be engineered to ensure that the repository is trusted by its users—only that it might deserve their trust.

By 1990 people had apparently forgotten why trusted made sense only within systems that included the user as a controlled component.  Xerox used the terminology for network printers that were supposed to enforce rules for sensitive documents.  As far as I know, these machines were flawed and failed in the marketplace, perhaps because the intended customers understood the problem.  But the misuse of trusted persisted and was amplified in [Stefik] and related works; Stefik evaded the offer of responding to criticism raised in 1997 [HMG3].

“Trusted Systems” are not necessarily trustworthy, and trustworthy systems are not necessarily trusted by their intended users.  Misleading language impedes achieving the trust a repository needs to be effective.  An eloquent argument for correct English usage in this is:

`I don't know what you mean by "glory,"' Alice said.

Humpty Dumpty smiled contemptuously. `Of course you don't—till I tell you.  I meant "there's a nice knock-down argument for you!"'

`But "glory" doesn't mean "a nice knock-down argument,"' Alice objected.

`When I use a word,' Humpty Dumpty said in rather a scornful tone, `it means just what I choose it to mean—neither more nor less.'

`The question is,' said Alice, `whether you can make words mean so many different things.'

`The question is,' said Humpty Dumpty, `which is to be master—that's all.'

                                                                                                                                       Lewis Carroll, [Carroll], p.213]

Lewis Carroll was fully aware of the pro­fundity in Humpty Dumpty's whimsical dis­course on semantics.  … the point of view known in the Middle Ages as nominalism; the view that universal terms do not refer to objective existences but are [merely] … verbal utterances.  [This] view was skillfully defended by William of Occam and is now held by almost all contemporary logical empiricists.

Even in logic and mathematics, where terms are usually more precise than in other [disciplines], enormous confusion often results from a failure to realize that words mean "neither more nor less" than what they are intended to mean.   

On the other hand, if we wish to communicate accurately, we are under a kind of moral obliga­tion to avoid Humpty's practice of giving private meanings to commonly used words.

                                                                                                                    Martin Gardner, [Carroll, p.213, note 11].

Digital Document Preservation

Like the [RLG 2] authors, our long-term objective is “to reach consensus on the characteristics and responsibilities of … digital repositories for large-scale, heterogeneous collections held by cultural organizations.”  We believe that everything essential to trustworthy and durably useful preservation of decidable[23] information is known in principle.  The rest of the work needed is engineering reduction to economical practice.

Digital document preservation overlaps only partially with collection management thinking that is the focus of [OAIS] and writings that depend on [OAIS]. (Figure 1)

Figure 1: Digital document preservation and cultural collection management ([CLIR/LC], [RLG 2]) are topics that overlap only incompletely. [24]

Any document or blob can be represented by a bit-stream—a sequence of 0’s and 1’s.  Computer programs are specialized documents.  Collection preservation will be achieved if we:

¨       Save the bits so that somewhere a copy survives and that copy can be found.

¨       Ensure that the bits can be interpreted.

¨       Make the bits trustworthy by reliably associating sufficient metadata.

¨       Include library content lists among the set of saved documents.

If these requirements are met, digital libraries can be constructed or reconstructed.

For simple document types, extra work to ensure long-term interpretability is not essential—merely cost-effective.  Document collections can to contain sufficient redundancy for digital archeology—rescuing content from obsolete technology—when the content is wanted.  Choosing to prepare a document for retention is an economic decision that depends on the expected number of retrievals and whether one is willing to expend on behalf of unknown future beneficiaries.  For archivists, digital archeology is almost a “do nothing” tactic that leaves most of the work to whoever is interested in each saved document.

Computer program preservation cannot depend on digital archeology because programs rarely contain sufficient redundancy. 

We believe that progressive dissection of the above solution components will expose no unsolved problem.  However, we will not be sure of this until dissection is sufficiently advanced, and we will not be able to persuade everyone we would like to persuade.  The question is whether the technology for every kind of information is covered without further invention. 

Absent an orderly dissection and analysis to sufficient depth, mistakes or shortfalls in document schema or handling procedures might not be discovered until the damage they cause can no longer be corrected, risking information loss.  Due care would include written reliability and security analysis of schemas and programs—analysis more careful than is common.[25]  Ideally, every risk would be examined and minimized.  Complete analysis is unlikely.  Even so, careful analysis will be tedious and at times arcane.

 “Top Down” Engineering

The circumstances as well as the technology lend themselves to top-down analysis.

     The topic is large, embracing all kinds of information; some kinds call for prompt treatment, and new kinds are likely to be designed.

     Designing well for each kind of information before its instances are archived will minimize overall costs, i.e., up-front design is cost-effective for designs that are used.

     Different communities will be interested in different branches of the analysis, and in different depths; almost nobody will want to examine everything at the same time.