|
Digital Document Quarterly Perspectives on
Trustworthy Information |
Volume
3, Number 2, 2Q2004 |
|
|
|
|
|||
|
|
HMG
Consulting Saratoga, CA 95070 |
©
2004, H.M. Gladney ISSN: 1547-8610 |
“It is with philosophy as with religion: men marvel at the absurdity of other people's tenets, while exactly parallel absurdities remain in their own, and the same man is unaffectedly astonished that words can be mistaken for things, who is treating other words as if they were things every time he opens his mouth to discuss. No one … will deny that the mistaking of abstractions for realities pervaded speculation all through antiquity and the middle ages. The mistake was generalized and systematized [by] Plato. The Aristotelians carried it on. Essences, quiddities, virtues residing in things, were accepted as a bonâ fide explanation of phenomena. Not only abstract qualities, but the concrete names of genera and species, were mistaken for objective existence.” John Stuart Mill[1]
The main section below, Topics in Digital Preservation of Knowledge, argues that the technical component of digital preservation research and development should focus on the design and management of digital objects ‘hardened’ for durability. It leads to today’s most difficult conceptual question: “How can today’s information producers represent their output so that its eventual consumers might be able to understand the meanings that these producers intend to convey?”
To suggest how to proceed towards practical digital preservation, this section combines prior DDQ material with analyses of economic projections and technical trends. For research into answering the difficult conceptual question, we believe that the soundest foundation is early 20th century theories of empirical knowledge. We identify seminal works that collectively seem a sufficient basis—works by Wittgenstein, Carnap, Quine, Popper, and Nimmer. We recommend specific ideas from these sources as starting points for research into preserving knowledge.
Information scientists concerned with digital preservation seem to have focused on repository functionality and management. In contrast, DDQ has consistently focused on preserving documents, partly because digital library technology is well understood and presents no conceptual preservation issues. My colleagues and I believe the focus on repositories, rather than on preserved objects that repositories manage, to be misplaced and not in the best interests of archival institutions or their professional staffs.
Our architectural focus is driven by economic trends and deployed information network characteristics. Since these apparently do not much influence most other digital preservation thinking, we sketch a subset below and suggest why they drive us in directions more fully articulated in our Trustworthy 100-Year Digital Objects (TDO) papers. Since the Trusted Digital Repositories[2] and the [U.S.] National Digital Infrastructure Preservation Program (NDIIPP) reports represent Digital Library Federation consensus and identify a funded technical plan,[3] we take these as the articulation of the focus we believe misplaced.
The reader might be surprised at the length and depth of our analysis of design imperatives sketched below. Our care and the reader’s critical attention are mechanisms for avoiding, or discovering and repairing, systematic errors. They are driven by (1) software engineering precepts suggesting that high quality is favored by spending relatively more time on design and less on implementation and deployment; (2) recognition that careful design is inexpensive compared to lifetime utilization costs; and (3) expectation that errors embedded in preserved digital objects might not be noticed by their eventual consumers and, if noticed, not be correctible by them.
We cannot be confident that our preservation method is sound without a profound understanding of what we mean by ‘knowledge’. We need a thorough analysis of preservation objectives that includes an answer to, “Precisely what is it that creators want to communicate to future generations?” This is not in the sense of selecting which documents we want to preserve, but rather in a sense suggested by Levy.[4] The best basis we know for confidence in methodology is certain early 20th century analytical thinking.
"Prediction is difficult. Especially about the future." —attributed to Niels Bohr
We are in the midst of widespread changes[5] in how people interact with information, how it affects their lives, and how information will be managed in a networked world. The information science literature about digital preservation pays less attention to economic factors and technical trends than to examining how current paper-based repository methods can be adapted to a digital world.
The shift of information search from library services to private sector services might be a harbinger for further disintermediation.[6] For instance, academic faculty members and private individuals often provide superb information organization and deliver this directly to consumers.[7] Librarians and library institutions might believe disintermediation undesirable—socially as well as for their professional and institutional futures. If so, they need to start leading part of the information revolution rather than merely following.
Some trends are well known, at least in the sense that they are often mentioned in the business literature.[8] Some whose strategic consequences bear thinking about are: [9]
· The number of people with education, leisure, and interest in reading and writing is much larger than it has ever been, even as a fraction of the total population, and is growing.[10]
· Many people younger than 30 years tend to be more comfortable with digital technology, and more skilled in its use, than most people 50 and older. The latter group includes most of the decision makers in libraries and archives.
· Digital technology is becoming affordable in lesser-developed countries, some of whose people are becoming Internet users, particularly in China[11] and India.
· The amount of digital information that might be preservation-worthy is growing rapidly. The many estimates[12] suggest that the portion represented by research library collections is small and shrinking.
· The number of contributors to information management and search technology is much greater than the digital staff of traditional libraries and archives, and growing.
· The information services industry is changing rapidly to exploit the Internet[13] and to provide scaling to very large digital object collections.
· Any document is potentially linked to many documents of other kinds. We cannot partition the world’s collections into unconnected partial collections. For instance, we can neither define an impervious boundary between cultural documents and business records, nor segregate picture collections from pure text files.
· Every document contains references to other documents that are essential to its interpretation and provenance evidence.[14] These references might not be explicit.[15] We represent such references as citations, or links, or pointers.
· The content of any collection or individual document is the result of some individuals’ subjective choices. As a consequence of this and the prior points, there is no structural distinction between a document collection (a.k.a. ‘library’) and an individual document.[16]
· The information quality and evidence of authenticity that people expect has increased steadily since early in the 20th century (when radio broadcasts and music recordings became popular).
|
|
|
Figure 1: How Much Storage will $200 Buy? The 2004 point ihas been added to the NY Times chart. |
· Automation is now inexpensive compared to human labor.[17] For instance, see Figure 1. It is reasonable to plan a home computer with a terabyte of storage!
· Information consumers, information producers, and information service providers will not change their tools to accommodate digital preservation, except for very modest upwards-compatible modifications. The provider who plausibly promises the least disruption will win.
· Many applicable technical specialties are highly refined, with their own extensive and deep literature, and active interest groups. For instance information retrieval is represented by ACM SIGIR.
Librarians have been thorough in investigating what history teaches about preservation.[18] They might balance that by similar care in looking forward.
Archivists have more than once changed their collective opinion about what information representations are worth preserving. Levy and O’Toole suggest that it is time for another change, without specifying the nature of the change precisely.[19]
Large-scale digital preservation will be affordable only if we automate every human processing step that can be replaced by a machine procedure. However, we should not preclude any human intervention based on human judgement and values.
The literature suggests practical urgency because older digital content is already being lost. A second urgency is that metadata—provenance information needed to convert a document into an archival record—are best created and packaged with the active participation of each document’s creator(s).
Protocol and data representation standards for information interchange are a key focal topic for digital preservation.[20] At some level, all documents being interchanged must share structural schema.
To avoid troublesome ambiguities of reference, we must assign a unique reference name (a.k.a. ‘identifier’) to each digital object. We often find it useful to assign more than one name to an object.[21]
Integrating digital networks from lightly coupled components has been common for more than a decade. For content management the accepted infrastructure components that need to accommodate long-term preservation include:
· File storage management,
· File replication,
· Primary catalog management in relational databases (DBMSs),
· Search index management,
· Search engines,
· File formats conforming to international standards,
· Metadata conforming to international standards,
· Access control and digital rights management services,
· A document storage subsystem binding files and catalog records (see Figure 2), and
· A document manager layer in which all local customization is implemented.
|
|
|
Figure
2: Relationships of components of a digital document repository. In
contrast to the usual usage of ‘trusted’ in “Trusted Digital Repositories”,
the usage here is correct.[22] |
System layering[23] is essential to partition technical responsibilities, to enable software porting across hardware and operating system platforms, and to permit customization wanted by different institutions and sometimes by individual users. Figure 2 suggests some of the layering and some of the functional components of content management services.
OAIS
permits differences between a ‘Submission Information Package (SIP)’ and its
corresponding ‘Archival Information Package (AIP)’ and ‘Distribution
Information Package (DIP)’. However, as
can be inferred from Figure
3, to ensure that the document representation that a
consumer receives is independent of network path by which it reaches him,[24]
each DIP needs to be identical to its corresponding SIP. Repository clients (producers and consumers)
will not care how AIPs are represented.
|
|
|
Figure
3: Digital object paths from producer to consumer. Copies
of a particular object might reach the consumer by paths that he cannot
control and that might be different from time to time. |
Repository institutions should work to encourage content producers to submit objects already packaged for preservation to share preservation costs, to exploit producers’ knowledge and competence, and to mitigate the challenges of scaling to large collections.
“Documents are talking things. … The brilliance of writing is the discovery of a way to make artifacts talk, coupled with the ability to hold that talk fixed, so that a fixed, stable message can be carried through space and time. It is something that documents do well and people by and large don't. It is not that we are incapable of performing in such a manner … but it is not of our essence to do so. Yet it is exactly of the essence of documents, a defining characteristic.” David Levy[25]
|
|
|
Figure 4: Data, information, knowledge, and understanding[26] |
Assuming that what we want to preserve is knowledge, we might start by agreeing what we mean by ‘knowledge’. Popper’s ‘World 3’ definition (see below) is particularly apt, and consistent with modern articulations such as that suggested by Figure 4.
Beyond that, what we know in principle about the technical parts of digital preservation includes:
· How to protect information packages from being lost.
· How to package information so that its eventual users can reliably test its trustworthiness.
· How to encode information so that it can be rendered reliably.[27] In this context, ‘rendering’ includes execution of computer programs.
An open engineering challenge is illustrated by word processor documents whose users want preservation of all possible renderings. Saving ‘.doc’ (e.g.) files is not enough, since the renderings are articulated by vendor software that includes operating system components and other vendors’ device drivers. Extracting and saving the necessary programs is made difficult by vendor secrecy.
The most difficult previously expressed digital preservation objective[28] is “ensuring that information consumers can read or otherwise use each preserved object as completely as its producers intended.” Accomplishing this is, in principle, impossible for at least some data types. A prudent revision of the challenge is, perhaps, “how can producers today represent preserved information to minimize each eventual consumer’s misunderstandings of what these producers intended to convey?”
What sound basis exists for choosing how to convey digital documents? Arguably, the best available foundations for analysis are found in early 20th century thinking. Provisionally, a sufficient selection is:
1. Ludwig Wittgenstein’s Tractus Logico-Philosophicus distinction between objective and syntactical concerns, on the one hand, and subjective and semantic concerns, on the other hand.[29] His Philosphical Investigations teaches that every use of language—a word, a sentence, a report, a book—is comprehensible only in the context of innumerable other communications.[30]
2. Rudolf Carnap’s The Logical Structure of the World[31], which starts with a pragmatic notion of ‘object’:
“The word "object" is here always used in its
widest sense, namely, for anything about which a statement can be made. Thus, among
objects we count not only things, but also properties and classes, relations in
extension and intension, states and events, what is actual as well as what is
not.”
The
Logical Structure of the World, §1.
Carnap grounds a small number of objective definitions in ostensive use of relations and outlines a construction method for articulating more complex objects.
3. Karl Popper’s 1967 essay Knowledge: Subjective versus Objective,[32] which includes:
“… without taking the words `world' or `universe' too seriously, we may distinguish … first, the world of physical objects or of physical states; secondly, the world of states of consciousness, or of mental states, or perhaps of behavioural dispositions to act; and thirdly, the world of objective contents of thought, especially of scientific and poetic thoughts and of works of art.
“… consider two thought experiments:
“Experiment (1). All our machines and tools are destroyed, and all our subjective learning, including our subjective knowledge of machines and tools, and how to use them. But libraries and our capacity to learn from them survive. Clearly, after much suffering, our world may get going again.
“Experiment (2). As before, machines and tools are destroyed, and our subjective learning, including our subjective knowledge of machines and tools, and how to use them. But this time, all libraries are destroyed also, so that our capacity to learn from books becomes useless.
“If you think about these two experiments, the reality, significance, and degree of autonomy of world 3 (as well as its effects on worlds 1 and 2) may perhaps become a little clearer to you. For in the second case there will be no re-emergence of our civilization for many millennia.”
4. Willard Orman Quine’s Word and Object teaches how to map normal language usage to relatively unambiguous forms inspired by formal logic.
“According to an influential doctrine of Wittgenstein's, the task of philosophy is not to solve problems but to dissolve them by showing that there were really none there. This doctrine has its limitations, but it aptly fits explication. For when explication banishes a problem it does so by showing it to be in an important sense unreal; viz., in the sense of proceeding only from needless usages [of language].
“…
“It is ironical that those philosophers most influenced by Wittgenstein are largely the ones who most deplore the explications just now enumerated. In steadfast laymanship they deplore them as departures from ordinary usage, failing to appreciate that it is precisely by showing how to circumvent the problematic parts of ordinary usage that we show the problems to be purely verbal.” Word and Object, §53.
5. David Nimmer’s Adams and Bits: of Jewish Kings and Copyrightsb[33] identifies what can be protected, and therefore much of what is worth preserving.
“News Item: Fire swept through the converted grain silo that Naomi Marra has called home … Feared lost among the charred ruins is the last extant copy of her lyric ode, Ruthless Boaz. … devotees hope that, following her many public declamations of the work, most or all of it may remain preserved in her memory. … Query: Is Ruthless Boaz still subject to statutory copyright protection?”
With this hypothetical case, Nimmer analyzes the protection of intangible value—patterns inherent in the reproductive instances of each document.[34] The essential patterns of a document are those needed to allow it to be Levy's “talking thing”.[35]
For the purposes at hand, we need not read earlier than 1920. Collectively, Wittgenstein, Carnap, and Quine acknowledged and progressed from the work of Emmanuel Kant, Auguste Comte, Heinrich Hertz, Karl Weierstrass, Ernst Mach, Gottlob Frege, David Hilbert, and Bertrand Russell. All later epistemological thinking was based on the work of these masters.
The treatment above emphasizes permanently significant aspects of long-term digital preservation. It provides part of the reasoning that leads us to believe that the architecture described in our Trustworthy 100-Year Digital Objects work is forced by the existing information infrastructure and by end users’ needs.[36]
What follows su