|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 1, Number 2, 1Q2002 |
|
© 2002, H.M.
Gladney |
|
|||||||
|
|
HMG
Consulting 20044
Glen Brae Drive Saratoga,
CA 95070 (408) 867-5454 |
|
||||||
|
|
|
|
||||||
In contrast, DDQ emphasizes technical measures centered on representation of individual documents,[1] responding to basic interests of consumers—that they can prudently decide whether or not information received is sufficiently trustworthy for how it is to be used.[2]
We believe that:
Sufficient
fundamentals are known for making any digital document survive forever in
useful form. However, considerable
engineering effort will be needed to provide practical and economical
prescriptions for repository managers and end users.[3]
Basing
institutional practices on the solution for individual documents will achieve economies
not inherent in the current research library consensus direction.
Achieving
trust by audit trails built into digital documents will be surer and cheaper
than seeking this objective by defining institutional procedures certified by
outside committees.
Current
preservation research insufficiently emphasizes well-known engineering
practices that include forestalling potential failures, objective assertions
about system properties and document authenticity, and technology that helps
any end user test such assertions.
Current
uncertainties and confusions can be reduced by clear distinctions between
assertions that are facts and those that are opinions. Early 20th-century philosophy
teaches the extent to which such distinctions are possible and the pertinent
limits of language. To proceed without
this foundation is hazardous.[4]
The reader might correctly expect careful exposition of these issues to be both lengthy and tedious, because the topic at hand is preserving forever whatever kinds and amounts of information people might choose. Doing so reliably requires that we anticipate and forestall every likely failure.[5] However, while the engineering needed for a comprehensive and sound solution is likely to seem slow and expensive, the importance and applicable breadth of reliable solutions more than justify the anticipated effort. In fact, accomplishing quality design before investing heavily in preservation will pay for itself many times over.[6]
DDQ avoids burdening any reader with more detail than (s)he cares for by a top-down, divide-and-conquer exposition that parallels common engineering practice for large artifact design. DDQ 1(2) enlarges on this.
Acquiring a clear and comprehensive grasp of digital preservation can be time-consuming. The literature is voluminous and highly redundant,[7] but it must be inspected if one wants to catch everything new and significant. However, a few recent reports, considered together, convey the state of the art and also articulate the dominant opinion of how to proceed.
An NSF scientific computing proposal is on a different topic, but requires critical attention.
(1) U.S. National Strategy for Digital Preservation
The National Digital Information Infrastructure and Preservation Program was established by Congress to "develop a national strategy to collect, archive, and preserve the burgeoning amounts of digital content, especially materials that are created only in digital formats, for current and future generations." As an initial step, the Library of Congress scheduled a series of meetings with groups from the technology, business, entertainment, academic, legal, archival, and library communities. Background papers were commissioned to give "environmental scans." These papers, collected as Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving [CLIR/LC], are available on the new NDIIPP web site.
The CEDARS project (1998-2002 at Leeds, Oxford, and Cambridge Universities) aimed to provide best practices guidance for long-term digital preservation. In May it completed with:
The
Cedars Guide to
Intellectual Property Rights
The
Cedars Guide to
Preservation Metadata
The
Cedars Guide
to Digital Collection Management
The
Cedars Guide
to Digital Preservation Strategies
The
Cedars Guide to the
Distributed Archiving Prototype
These reports collect British opinion on digital preservation, but contain little that is not better expressed elsewhere.
(3) Glasgow Effective Records Management
[Currall] teaches aspects of digital repository service in ways that are ready to be applied, and can therefore be recommended. This report discusses:
identifying
the roles and responsibilities of records creators;
creating,
using, disseminating, and eradicating digital records;
training
users in digital record management practice;
improving
retrieval speed for documents and elements within them;
increasing
accuracy for items within a document and individual document choice from a
collection; and
reducing
organizational risk from unmanaged records.
(4) Prototype Digital Archive Technology at SDSC
The U.S. National Archives and Records Administration has for about two years sponsored prototype archival storage management development at the San Diego Supercomputer Center. There’s enough meat in this that I have not yet absorbed its teachings and implications. [Ludasch] will interest the technically-inclined reader.
(5) “Trusted Digital Repositories …”
DDQ 1(1) commented critically [HMG2] on the mid-2001 draft of a Trusted Digital Repositories report [RLG], partly because its reliance on “Trusted Computing Base (TCB)” ideas would lead to expensive overkill. The final 68-page report [RLG2] corrected this, but did not respond to other problems communicated in October.
Certification
of archive procedures[8]
is inherently more expensive and less reliable than basing an archive on stored
objects that individually embed their own audit trails;
Technology
cost trends favor data replication and other storage design elements not
considered, such as not requiring the centers of trust to be the only
repositories; and
[RLG]
did not consider how many documents might deserve archival preservation—a
number potentially so large that procedures proposed might be unaffordable.
The implications of the word “trusted” are taken up below. Beyond that, since the issues raised by [RLG 2] are too tricky for my analysis to be complete, I’ve posted a slightly edited version of the October critique. The interested reader might think through all this himself while we consider how to continue.[9]
(6) NSF Scientific
Computing Proposal
In April, an NSF Advisory Panel released Revolutionizing
Science and Engineering through Cyberinfrastructure. This first draft has flaws, such as
insufficiently identifying what will be accomplished with the $650M it calls
for.[10] Even though the panel requested comments by
1st May, it is likely to address criticism received later. In view of the amount sought from taxpayers,
you may wish to inspect the proposal and comment.
(7) Promising Items Just
Received
Towards a Continuing Access and Digital Preservation Strategy … is a U.K. call for comment.
By e-mail I received a 2001 NASA ESDIS Data Center Best Practices and Benchmark
Report.
OAI has released the Protocol for Metadata Harvesting v.2.
DDQ 1(1) considered When Was the Information Revolution? We might further wonder what changes have happened, are happening, or will happen, and how they should influence decisions for digital archives. Four years ago, href="ddqcites.htm#Neal98">Neal] ventured an academic librarian’s perspectives; shortened, reorganized, and adapted for the passage of time and a somewhat different viewpoint, a subset of his 25 cataclysmic changes in information exploitation is:
· Individualized expanding power to access, communicate, and analyze information.
· Office tools automating the mechanical aspects of writing and enabling self-publication.
·
Knowledge management technology making data gathering and management valuable.
· Hypertext reducing the tedious drudgery of following bibliographic references.
· Virtual reality enabling experiential entertainment, education, and engineering.
· Digital storage enabling storage and search of vast multimedia resources.
· Networks expanding personal communication and information sharing.
· Push transmission enabling massive customized communication to select audiences.
· Cellular communication freeing us from constraints of location.
· Intellectual property made valuable and more at risk by easy copying and distribution.
· Security and encryption providing us all with tools to control information access.
· Lowered barriers easing market entry and experimentation for innovative services.
· Self-service (such as ATM banking) for enterprise efficiency and customer convenience.
· Outsourcing permitting us all to shed routine tasks in order to focus on core competencies.
· Partnerships with more cooperation and sharing as essential for success and efficiency.
· Concentration in which organizations absorb each other seeking market advantage.
· Student population shifts to include social groups that have not been much involved.
·
Global awareness with rapid communication internationalizing all aspects of life.
· Political and social volatility enhanced by easier, faster information use and misuse.
These factors are
stimulating organizations to plan new services. Enterprise survival is often an issue.
The list reflects the writings, talks and informal conversation of librarians and scholars. Is it in any sense complete? Probably not. For instance, it does not hint at induced employment shifts.
If any such perception[11] will be used to guide policy and action, it merits careful investigation to estimate its economic significance and influence on decisions, e.g., for managing libraries [Wolf]. We have too many opportunities and not enough skills and resources. We need to decide which designs have greatest leverage and will most please users—both end users and service personnel. We need to make choices and would do this better with more quantitative estimates than have been published.
While quantitative estimates are needed, even the
qualitative nature of Neal’s list merits more inquiry. For instance, for the Information
marketplace, Neal wrote, “The information
as commodity revolution is increasingly viewing data and its synthesized
products, knowledge, as articles of commerce and sources of profit rather than
property held in common for societal good.”
This can be read as a social critique demanding remedy—and some people
do so with evident emotion. However,
this interpretation should be balanced by recognition that the information
business is the livelihood of an increasing population fraction and also that
the information easily available without payment is also increasing rapidly.[12]
“A good archival system will avoid value judgments whenever possible and will publicly expose its mechanisms, leaving the remainder as value claims that the eventual user of each stored datum needs to accept or reject according to her own values and needs.” [DDQ 1(1)] We are seeking an archival document representation scheme that can express everything that admits unambiguous, objective, and testable representation—and nothing more, leaving to other language all values and judgments that must forever remain debatable.[13] Whether or not a document will be trusted is such a value judgment.
Testable is critical. Every important property of each archived document and of each archiving system component must be explicitly specified (asserted), and the archived data must include whatever is needed to test these assertions.[14] These assertions and test procedures—what certified public accountants call audit trails and audit tools—must be firmly bound to each document. The assertions will nearly always need to identify “whodunit”, and the “whodunit” information will itself be documents or document portions that require their own audit trails. This collection of documents and procedures is what we mean by external evidence of authenticity and provenance, and in combination with internal evidence[15] is what can make the document trustworthy. An eventual user will trust a document if such associated evidence is accessible, reliable, and sufficient for the use that (s)he intends to make of the document content.
This might seem a prescription for endless tests that themselves must be tested. How can we avoid recursive explosion—a chain reaction that blows up in our faces? We can share each individual fact and test among many objects and can end each recursion with facts that are widely known and trusted.[16] The technical mechanisms will use identifiers, references, pointers, links, and XML namespaces, and the document representations will be isomorphic to directed acyclic graphs rather than merely trees.[17] DDQ starts to describe the technology below.
How Can We Use Wittgenstein’s Philosophy?
Who is to decide what evidence is important and what tests are worth applying? That has to be the end user—the person for whom the document in question is preserved. This will invariably be an economic decision—a weighing of the cost of evidence testing against the costs of a wrong decision. A role of technology is to minimize these costs.
For this we want the soundest possible foundation and believe we can do no better than a basis in Wittgenstein’s work. He provided a nexus in the theory of language and logic (and therefore computing), arguably completing inquiries by Bertrand Russell, Emmanuel Kant, Ernst Mach and many lesser-known central European philosophers, and setting the stage for later work by Rudolph Carnap, Kurt Gödel, Alan Turing, and John von Neumann. The famed Wiener Kreis analyzed the Tractatus Logico-Philosophicus (TLP) twice, and later investigation rarely reached backward earlier than TLP.
We might want a concise and complete prescription for communicating meaning, but trying to compress into a few words what hundreds of philosophical writings have not completed would be foolish. Nevertheless, a focus on Wittgenstein’s notion of a rule may help towards understanding how his teachings apply to digital preservation methodology.[18] We cannot convincingly say why Wittgenstein’s teachings are the best available ground, but we can show what use we make of them. What the reader should look for to evaluate DDQ analyses is how it uses thinking taught by Wittgenstein. Specifically:
We
are sensitized to the limits of language and the consequences of
misunderstanding communications expressed in words. In particular, potential misunderstandings between different
professions figure prominently.
We
look at how authors use key words, and try to decide whether each sentence is a
statement of a fact or an opinion.
Such
distinctions help us stay within the engineering role, avoiding inadvertently
infecting technical solutions with value judgments that clients might find
inappropriate.
The
limits of what can be automated or specified as clerical tasks are rules that
expand into a finite number of steps for any particular objective.
LW’s
discussions of language suggest that we cannot articulate meaning any better
than is accomplished by current ontologies
(reference models) and the RDF language.
The critical distinction is perhaps best expressed by Paul Engelmann’s metaphor:
"Positivism holds—and this is its essence—that what
we can speak about is all that matters in life, whereas Wittgenstein
passionately believes that all that really matters in human life is precisely
what, in his view, we must be silent about. When he nevertheless takes immense pains to delimit the unimportant
[i.e., the scope and limits of ordinary language], it is not the coastline of
that island which he is bent on surveying with such meticulous accuracy, but
the boundary of the ocean." [Janik, p.191]
Although we cannot define the boundary in words, we can illustrate it by tabulating word pairs:
|
Island (terra firma) |
|
Ocean |
Comments |
|
objective |
|
We can relate any subjective
assertion to some objective one, e.g., “DDQ writes that the digital
preservation literature is highly redundant.” |
|
|
Natural Philosophy |
|
Ethics |
Until academic topics became highly differentiated early in
the 20th century, these were the major categories of learning. |
|
physics, mathematics, logic |
|
metaphysics |
The Greek meta- means beyond. I.e., metaphysics is the part
of philosophy that deals with value judgments. |
|
facts |
|
Values and opinions |
Unfortunately, value has many fundamentally
different meanings; here the sense of “value judgments” is intended |
|
die Darstellung → representation |
|
die Vorstellung
→ |
English translations of Wittgenstein’s die Darstellung are
sometimes misleading. |
|
evidence |
|
opinion |
evidence comes from the Latin evidens—visible,
clear, evident; also from ex + vident—according to + what
they see |
|
… … |
|
… … |
Many more word pairs might appear here. |
|
form or syntax |
|
meaning or semantics |
In modern markup languages, XML deals with form and
structure, RDF with meaning. |
|
action |
|
intention |
[LW 39, II] treats “intention” carefully. |
|
intensional |
|
extensional[19] |
Sets and sequences are communicated either
intensionally—“(3n+1) where n is a natural number”—or extensionally—“1, 4, 7,
10, …”. |
|
“Prove it!” |
|
“Let your light so
shine before men, that they may see your good works, …”[20] |
Wittgenstein’s view about what cannot be said, but must be
shown, is foreshadowed in the Bible. |
|
… … |
|
… … |
Still more word pairs might appear here. |
|
rules, procedures |
|
intuitions |
Whether a man “knows or not is simply a question of whether
he does it as we taught him; it’s not a question of intuition”. [LW 39, II] |
|
possibly computable |
|
surely not computable |
The technical word in mathematics and computer science is decidable
rather than computable. |
|
trustworthy |
|
trusted |
The creator of an object can make it trustworthy,
but only its end user can judge whether or not it is to be trusted. |
Imagine having to write an essay comparing the value of Shakespeare’s Hamlet to that of Lucasfilm’s Star Wars. Your statements might oscillate from side to side of the boundary.
Originators, editors, and librarians can make a preservation document trustworthy (deserving of trust) by binding metadata and test procedures to it reliably, i.e., so that the binding is firm over time and not itself susceptible to undetectable modification. Whether or not the document will be trusted by its eventual user will depend both on how well those who prepare it choose and bind the metadata and tests and also on the user’s judgment.
The very naming of [RLG 2],
“Trusted Digital Repositories”, is troublesome because it sets a trap for the
unwary—a sort of false advertising. “Trusted”
should not be used when what is meant is “trustworthy”.
We might have passed silently over this point were it not for pervasive confounding of objective and subjective aspects of digital preservation. Trust is of fundamental importance in document delivery services, not only for scholarly work [Lynch], but also for business transactions that will grow to include pharmaceutical development records and perhaps even personal medical records.[21] Some business applications are tempting fraud opportunities.
The [RLG2] authors did not invent sloppy use of trusted. The term appeared in 1975 in the Trusted Computing Base (TCB)—a processor kernel whose behavior could not be improperly altered by any executing process—intended for critical control and cryptographic services.[22] The designers called the kernel trusted because they also built the trusting entity—an operating system that needed the TCB to meet defense security criteria. This operating system was itself protected by the kernel against modification, and was certified as secure by teams other than its implementers. [NCSC2] To meet its objectives, the secure operating system needs to trust the kernel; i.e., we can say the TCB is trusted by the operating system.
In contrast, the relationship between a digital repository and its users cannot be engineered to ensure that the repository is trusted by its users—only that it might deserve their trust.
By 1990 people had apparently forgotten why trusted made
sense only within systems that included the user as a controlled component. Xerox used the terminology for network
printers that were supposed to enforce rules for sensitive documents. As far as I know, these machines were flawed
and failed in the marketplace, perhaps because the intended customers
understood the problem. But the misuse
of trusted persisted and was amplified in [Stefik] and related works;
Stefik evaded the offer of responding to criticism raised in 1997 [HMG3].
“Trusted Systems” are not necessarily trustworthy, and trustworthy systems are not necessarily trusted by their intended users. Misleading language impedes achieving the trust a repository needs to be effective. An eloquent argument for correct English usage in this is:
`I don't know what you mean by "glory,"' Alice said.
Humpty Dumpty smiled contemptuously. `Of course you don't—till I tell you. I meant "there's a nice knock-down argument for you!"'
`But "glory" doesn't mean "a nice knock-down argument,"' Alice objected.
`When I use a word,' Humpty Dumpty said in rather a scornful tone, `it means just what I choose it to mean—neither more nor less.'
`The question is,' said Alice, `whether you can make words mean so many different things.'
`The question is,' said Humpty Dumpty, `which is to be master—that's all.'
Lewis Carroll, [Carroll], p.213]
Lewis Carroll was fully aware of the profundity in Humpty Dumpty's whimsical discourse on semantics. … the point of view known in the Middle Ages as nominalism; the view that universal terms do not refer to objective existences but are [merely] … verbal utterances. [This] view was skillfully defended by William of Occam and is now held by almost all contemporary logical empiricists.
Even in logic and mathematics, where terms are usually
more precise than in other [disciplines], enormous confusion often results from
a failure to realize that words mean "neither more nor less" than
what they are intended to mean. …
On the other hand, if
we wish to communicate accurately, we are under a kind of moral obligation to
avoid Humpty's practice of giving private meanings to commonly used words.
Martin Gardner, [Carroll, p.213, note 11].
Like the [RLG 2] authors, our long-term objective is “to reach consensus on the characteristics and responsibilities of … digital repositories for large-scale, heterogeneous collections held by cultural organizations.” We believe that everything essential to trustworthy and durably useful preservation of decidable[23] information is known in principle. The rest of the work needed is engineering reduction to economical practice.
Digital document preservation overlaps only partially with collection management thinking that is the focus of [OAIS] and writings that depend on [OAIS]. (Figure 1)

Figure 1: Digital document preservation and cultural collection management ([CLIR/LC], [RLG 2]) are topics that overlap only incompletely. [24]
Any document or blob can be represented by a bit-stream—a sequence of 0’s and 1’s. Computer programs are specialized documents. Collection preservation will be achieved if we:
¨ Save the bits so that somewhere a copy survives and that copy can be found.
¨ Ensure that the bits can be interpreted.
¨ Make the bits trustworthy by reliably associating sufficient metadata.
¨ Include library content lists among the set of saved documents.
If these requirements are met, digital libraries can be constructed or reconstructed.
For simple document types, extra work to ensure long-term interpretability is not essential—merely cost-effective. Document collections can to contain sufficient redundancy for digital archeology—rescuing content from obsolete technology—when the content is wanted. Choosing to prepare a document for retention is an economic decision that depends on the expected number of retrievals and whether one is willing to expend on behalf of unknown future beneficiaries. For archivists, digital archeology is almost a “do nothing” tactic that leaves most of the work to whoever is interested in each saved document.
Computer program preservation cannot depend on digital archeology because programs rarely contain sufficient redundancy.
We believe that progressive dissection of the above solution components will expose no unsolved problem. However, we will not be sure of this until dissection is sufficiently advanced, and we will not be able to persuade everyone we would like to persuade. The question is whether the technology for every kind of information is covered without further invention.
Absent an orderly dissection and analysis to sufficient depth, mistakes or shortfalls in document schema or handling procedures might not be discovered until the damage they cause can no longer be corrected, risking information loss. Due care would include written reliability and security analysis of schemas and programs—analysis more careful than is common.[25] Ideally, every risk would be examined and minimized. Complete analysis is unlikely. Even so, careful analysis will be tedious and at times arcane.
The circumstances as well as the technology lend themselves to top-down analysis.
The
topic is large, embracing all kinds of information; some kinds call for
prompt treatment, and new kinds are likely to be designed.
Designing
well for each kind of information before its instances are archived will
minimize overall costs, i.e., up-front design is cost-effective for designs
that are used.
Different
communities will be interested in different branches of the analysis, and in
different depths; almost nobody will want to examine everything at the same
time.