|
Digital Document Quarterly Perspectives on
Trustworthy Information |
Volume
1, Number 3, 3Q2002 |
|
|
HMG
Consulting 20044
Glen Brae Drive Saratoga,
CA 95070 (408) 867-5454 |
|
||||||
|
© 2002,
H.M. Gladney |
|
|||||||
In its first number, DDQ projected a 2002 emphasis on digital preservation archiving. The current number is influenced by workgroup discussions of preservation research needed[1] and by similar considerations for a (U.S.) National Academy of Sciences study requested by the National Archives and Records Administration (NARA).
These discussions suggest questions[2] that have been insufficient addressed. In this and future numbers, DDQ will identify a few questions that deserve increased attention and suggest unconventional possibilities. It starts with:
Ø For which communities and which kinds of public documents is digital preservation most important and most urgent? (See Selection Criteria below.)
Ø In view of unavoidable difficulties with language,[3] what are the limits of saving meaning, semantics, and knowledge for future generations? (See Semantics ... below.)
Ø How can deployments be planned so that unanswered technical questions neither hamper nor delay ingesting content? (See Why So Slow ... below.)
Readers might consider their own answers to these questions. To give plausible, effective answers does not require specialized expertise, but rather good judgment.
I formerly thought that selection for preservation was a
difficult challenge, but no longer believe so—at least not in the sense of being hampered by raising
worthwhile, but unanswered, research issueissues. Once the technical and organizational
challenges are overcome,[4]
digital preservation is likely to become a routine activity with priorities set
by each institution’s resource allocation process. Institutional objectives are likely to dominate selection
criteria.
Today’s selection costs are exacerbated by the accelerating transformation from information scarcity to information overflow. Writing and dissemination were relatively rare and relatively slow in earlier centuries. For instance, the British Departments of State had about 50 clerks at the time of the American revolution; these clerks wrote with quill pens; their letters to North America took 6 to 10 weeks to deliver. Compare that to the sizes of today’s bureaucracies and the tools they use to create and disseminate information. Selection is much less challenging for old documents than for modern content; de facto, for old content, we benefit from a form of selection at the source.
It is today neither possible nor desirable to save everything. Decisions will occur, either by default or with varying degrees of care and insight. For governments and ordinary folk, Titanic 2020: A Call for Action [Lysakowski] suggests disaster for office files in popular formats. A concerned reader might start by thinking about which documents he would like to see saved.
In the public sector, the visible efforts towards preservation of “born-digital stuff” are focused on cultural content, on scientific data,[5] and on records of national significance. The public discussion and literature make few allusions to the interests of smaller political units, to educational priorities other than those of research scholars, to judicial systems, to health delivery systems, or to administrative collections of interest to ordinary citizens. Is this appropriate? How might taxpayers prioritize public records for preservation?[6]
For at least two decades, some people have shared a dream of the “longitudinal patient record”—a medical history that accompanied each person from birth to grave. Since the useful lifetime of today’s digital records is much less than healthy human lifetimes, preservation technology would be needed to fulfill this dream.[7]
A personal letter from a schoolmate illustrated other needs:[8]
Speaking
of the [Immigration and Naturalization Service], we are trying to see if [my
son] qualifies for [U.S.] citizenship on the basis of the fact that I did the
border shuffle [between Canada and the U.S.] for most of my natural life. Now it is a question of proving I exist, it
seems.
[I am]
trying to unearth papers to prove to the lawyers that I actually spent about
half my time on either side of the border from birth until I married! Did you know that anyone who attended HS
still in the 1950s is clearly so far back in the Dark Ages as to be almost a
non-person? Welcome to the real
world. The schools in [city] IL, where
I attended the first 3 grades, tell me they have no records of any students
born between 1931 and 1942; so much for that!
The
school board in [location] says [XYZ] High School no
longer exists. "If we had records, they would have been forwarded to the
HS you went to." [The Canadian
city] HS fortunately had registered me as having come in from [XYZ] HS, but kept no transcripts ….
And we haven't even been bombed or anything. No wonder half of people who lose their papers die of
despair. Bureaucracy is immovable! Yet, a front page story about a restaurant
on my block starts out with how the executive chef came to this country as an
illegal immigrant from Mexico! Any time
I have had dealings with the INS they have been expensive and exceedingly
unpleasant. So what else is new?
In 1991, an IBM Research group and a California Department of Transportation (DOT) department considered a digital library pilot for the construction and inspection records of thousands of bridges. During a proposal work-up, they visited the records room of a DOT regional office. It was in a clay-floored basement, with 30-year old drawings, handwritten notes, and typescripts stored in cardboard boxes. A sprinkler system had been installed for fire protection. Imagine working with soggy, partially-burned records! Are critical state records protected today by the digital equivalent of cardboard boxes and sprinkler systems?
Can we devise a way to establish what the public’s priorities would be and to adjust expenditures to represent these better than might today be the case? Or is doing so politically infeasible?
Between 1988 and 1994, scholarly advisory committees considered preservation content selection for History, Renaissance Studies, Philosophy, Mediaeval Studies, Modern Language and Literature, and Art History. [George] While this work was mostly done before digital capture was a practical option, similar considerations seem applicable today.
"One
theme is the understandable reluctance of scholars to make choices because of
the unpredictability of research needs.
Scholars are loath to say, 'this book will be more useful for future
research than that one,' because the history of their fields shows that writers
and subjects that seem inconsequential to scholars in one era may become of
great interest in the next, and vice versa.
Moreover, discovery and serendipity may lead to lines of inquiry
unforeseen. ...” [George]
This suggests a somewhat discouraging prospect for scholarly needs. Students' wants are easier to satisfy, as the secondary school student or college undergraduate assigned a term paper will choose the first pertinent and interesting material that (s)he encounters. Digitization can provide more interesting material than has been commonly available. It is becoming realistic for teachers to require students to find and work from original sources rather than from secondary opinions and other people's selections.
“Knowledge management” seems more prominent today than a few years ago. Under various names, this topic has been considered by the artificial intelligence community for at least three decades, and by librarians grappling with information discovery and library catalogues.[9]
Some people regard knowledge management to be a key component of digital preservation research. Doing so may be appropriate to a limited extent, but supporting pronouncements often blur the boundary with all the rest of scholarship, inevitably diminishing focus on the special challenges of preservation.
What can we preserve of meaning for future generations?
Preservation schema[10] should enable representing whatever authors want to express, within the limits of what language can express. This includes representing any relationships whatsoever among documents. However, a goal demanded by some preservation researchers—completeness of collections—is infeasible because completeness is a value judgment that cannot be expressed objectively.[11] Furthermore, only trivial collections have no references to works outside their own contents.
Are existing methods of representing meaning and linking digital documents sufficient for expressing anything that can be objectively expressed? Some future DDQ number will consider RDF [Beckett], and perhaps also languages invented for artificial intelligence research. However, to start that inquiry, we must remind ourselves of the meaning of “meaning”.
The
Meaning of
“Meaning”: Ontologies
The word ontology is used more now than it was ten years ago, and in a different sense than in philosophy. The reader can easily find many Web pages about what’s intended. For instance,
[For]
knowledge sharing, … ontology [means] specification of a
conceptualization … description
(like a formal specification of a program) of the concepts and relationships
that can exist for … a community of agents.
This definition is consistent with … set-of-concept-definitions, but
more general. …
What
is important is what an ontology is for. … An ontology is … used for making …
commitments. … [W]e choose to write an
ontology as a set of definitions of formal vocabulary. Although this isn't the only way to specify
a conceptualization, it has some nice properties for knowledge sharing … (e.g.,
semantics independent of reader and context).
An ontological commitment is an agreement to use a vocabulary (i.e., ask
queries and make assertions) in a way that is [internally] consistent …. We build agents that commit to
ontologies. We design ontologies [to]
share knowledge with and among these agents.
Tom
Gruber at http://www-ksl.stanford.edu/kst/what-is-an-ontology.html
DDQ uses ontology approximately as Gruber suggests. It is a synonym for reference model as used in [OAIS], and is related to librarians’ subject classification, ACM’s Categories and Subject Descriptors, and recent discussions of The Semantic Web.[12]
The introduction to DDQ held that, “Computers manipulate symbols that are surrogates for what they mean. A computer model is good if its pattern follows the pattern of what it stands for. … language consists of symbols taking meaning from how they are used.” We use reference models to identify what information we are conveying or services we are providing.[13]
My attention was drawn back to the Open Archival Information Systems Reference Model (OAIS) by a referee of a commissioned survey about business enterprises’ archiving. He castigated me for not discussing OAIS, supposing that I did not know of it. He might have been even more scathing had he known that I had noticed that business authors never referred to OAIS, and saw no reason to judge this inappropriate.
The incident stimulated a fresh reading of [OAIS]. The more I considered how OAIS was used in digital preservation articles, the more it puzzled me. I could not firmly determine whether the authors, who mostly were Research Library Group (RLG) affiliates, viewed OAIS as an ontology or were planning to use it as an architecture. Each of ontology and architecture has a role in technical and procedural designs. [OAIS §1.1] clearly articulates intention to provide an ontology. However, its diagrams illustrating relationships can mislead people into reading it to be an architecture. It does cross the line in:
Though
the OAIS reference model does not focus on these emerging techniques, it should
provide architectural basis for the prototyping and comparison of these
techniques. [OAIS, p.2-4]
Excerpts from unpublished draft recommendations[14] illustrate the confusion.
“The
Open Archival Information Systems (OAIS) Reference Model, a draft ISO
standard developed by the space data community with leadership from NASA and
the European Space Agency, is gaining rapid acceptance as a framework for
the basic technical architecture for digital repositories.”
“The
OAIS functional model defines the core functions of a repository as
administration, ingest (more commonly known as accession), archival storage,
data management, access, and preservation planning. Its information model defines various types of information
packages, which if implemented together with the functional model, provide a
means for separating long-term storage of bits or data streams from the
management of data and collections.
This model has been used to build prototype persistent archives that
can preserve data independent of any particular hardware and software
configuration.”
The confusion led to A Bigger Problem Called “OAIS” in DDQ 1(1); this squib stimulated debates that included an OAIS author, but that did not settle the issue.
Consider the following fragment of a reference model for residences—a
fragment cribbed with obvious modifications from the beginning of OAIS Ingest
[OAIS §4.1.1.2].
A residence may contain one or more areas called Kitchens. The functions of a Kitchen are illustrated in:

Figure 1: Kitchen function
(compare Ingest function [OAIS Figure 4-2])
The Receive Groceries function provides the
appropriate storage capability and entrance to receive a shipment from the
Grocer. Groceries may be delivered by
truck, or fetched by the cook, into temporary storage convenient for
unpacking. The Receive Groceries
function will represent a legal transfer of ownership of the groceries, and may
require that special controls be placed on the shipments. This function provides a receipt to the Grocer, which may include a request to send missing items.
The Quality Assurance function validates
correct receipt in the unpacking area.
This might include tasting a sample of each item, and the use of a log
to record and identify any shortfalls.
The Prepare Meal function transforms one or
more packages into one or more dishes that conform to culinary and health standards. This may involve boiling, frying, baking, or blending of contents
of grocery shipments. The Cooking
function may issue recipe requests
to a cookbook to obtain descriptions needed to produce the menu. This
function sends sample dishes for
approval to a critic, and
receives back an appraisal.
The Generate Menu function extracts …
This fragment suggests how we might map the OAIS reference model onto our residence model. Each OAIS function would correspond to a residence area.
How much does this reference model help towards building a
residence? It provides builders and
eventual residents a shared vocabulary.
However, each builder also needs instructions what kind of residence to
construct: a single family detached home, an apartment building, a military
barracks, a college residence, or a prison?
Just as our reference model says what it means to be a place to live—a residence,
[OAIS] articulates what it means to be a place to hold
information—a library or archive.
What’s missing is an architecture.
Instructions for a builder should include dimensions, location, and many other factors. Such detail would not appear in our reference model, just as [OAIS] does not distinguish among a research library, a state government archive, a corporate archive, or a personal collection. Missing in each case is high level design that differentiates among structural alternatives, quantifies spaces, resources, and flows, describes materials and surface finishes, specifies utilities and safety factors, and so on.
How much qualitative and quantitative detail must an architecture express? The customer decides. He will often accept conventional levels and styles of description, but will also have his own ideas and emotions about what is important. The architecture would describe every aspect on which the customer insists. It would be an essential part of a prudent construction contract.
[OAIS] was first published in 1999, and has been only slightly refined since then. It is disturbing that we can find few published architectures for digital archives, and that those we find are not paid attention to by the preservation community.[15]
Articles talk about OAIS. Librarians express concern about rapidly disappearing digital content. Nevertheless, in three years, nobody seems to have taken the next steps prescribed by conventional engineering practice. The Tractatus Logico-Philosophicus anticipates an apt summary for [OAIS] as a context for preservation archiving.
“… the truth of the thoughts that are here communicated seems to me unassailable and definitive. I therefore believe myself to have found, on all essential points, the final solution of the problems. And if I am not mistaken in this belief, then the second thing in which the value of this work consists is that it shows how little is achieved when these problems are solved.” Ludwig Wittgenstein, TLP Introduction
Consensus as an
Impediment to Progress
The OAIS-related problems raised by DDQ have to do with how
it is misunderstood, rather than with the content of the ISO proposal. Acceptance of OAIS by the research
library community is insufficiently forward-looking. Since OAIS first became popular about 3
years ago, little technical progress has occurred in the areas which it
articulates.
Rapid progress is impeded by the library community’s emphasis on consensus, when consensus is used to squelch productive conversation, particularly between librarians and engineers.[16] The research library community seems to value consensus so highly as to blind it to other values, such as the productivity of professional debate. [Marcum]
This criticism does not imply that consensus is unimportant. Consensus is, in fact, critical when action would be ineffective or inefficient without agreement about carefully selected aspects. These often exist in service delivery situations, such as common behavior in many libraries from which readers benefit. It is extremely valuable that any of us can, without special training or assistance, use the card catalogue of almost any research library in the world. Similarly, in the emerging digital world, we need to agree on information interchange standards. However, what works well for service delivery can be an immense nuisance when it is not yet decided what service should be delivered and how best to deliver it.
Behavior that’s good for running today’s libraries
can be an impediment to inquiring how to run tomorrow’sfuture
libraries. Research thrives on debate
that includes candid criticism of proposed solutions.[17] Researchers see little point in writing or
talking about what they agree with.[18] We make little progress by discussing
favorably what's written. To see what
we should do next or what research problems might exist, we must pay attention
to what is not yet satisfactorily addressed.
I’ve encountered misuse of consensus in discussions of OAIS. Several times in the last year, the response to a technical objection or question did not address the point raised, but rather was along the lines of “there is community consensus for OAIS,” and it seemed that the respondent was surprised that this did not satisfy me.
That OAIS has the approval of many research librarians is unsurprising. After all, [OAIS] is an ontology for how librarians refer to what they already do, extended to digitally conveyed information. For excellent reasons, it is conservative—laying language groundwork for minimum disruption in library processes by the addition of digital media to physical information carriers. To the extent that professional librarians feel OAIS accomplishes minimum change, they will approve.
Furthermore, librarians are encouraged by OAIS consensus among their peers, because their community values consensus so highly. However, emphasizing consensus rather than encouraging debate sets the stage for insufficient attention to research opportunities and potential economies.[19]
Concern about OAIS-focus would be silly without attractive alternatives. Trustworthy, Durable Digital Documents (TDDD) in DDQ 1(2) suggests another way of looking at the preservation challenge; there may be further alternatives. However, with few services committed to any particular approach to digital preservation yet, we should consider questions before jumping to proposed solutions. Collectively, the right questions are likely to suggest a much less expensive solution than seems inherent in today’s dominant focus.
[OAIS] asks, "What reference model for a research library (people, resources, processes, ...) is appropriate for archiving?" TDDD asks, "What characteristics will make document representations useful into the indefinite future?"
Such different questions are unlikely to lead to the same answer. The TDDD question suggests a simpler solution that is compatible with likely solutions to the OAIS question.
Some future DDQ number will include more about TDDD. For the time being, however, it seems best to consider direct benefits of focusing on digital document structures. The common feature of the following points is that all specifications are in terms of input/output or boundary conditions—it’s a “black box” approach that says next to nothing about the internal workings of an archive.[20]
Ø
Archive customers (information producers and consumers)
see and care only about document characteristics[21]
and how to find helpful documents. They
are not interested in how remote servicemachines accomplish
what they ask for—just how well the servicmachines
do so.[22]
Ø To specify static document properties and external properties of services (such as performance) is much easier than specifying how computers and service organizations should accomplish what they promise. Furthermore, we can easily test document properties and search performance. We would find it much more difficult to test the internal workings of an archive.
Ø Archive managers gain immense flexibility and freedom of action by committing only to input/output behavior. This flexibility will be immediately useful in acquiring or creating the technical components that accommodate idiosyncratic characteristics of each archiving institution.[23] It also clarifies what can be allowed to change with technology evolution.[24]
Ø Promising users and stakeholders only input/output specifications much reduces managers’ need to let outsiders meddle with the workings of archives. In the near future, we will show that external audits to establish trust can be much simplified from what [RLG2] suggests.
Ø
It is, a priori, not obvious that archives along the
style of current institutions will be either the most responsive or the most
economical for born-digital content.[25] Requiring conformance only to “black box”
properties[2016] will facilitate marketplace
exploration and experimentation.
Since the challenges were articulated in 1996 [Garrett], many conferences have been held and many papers have been written on the topic. They include reminders of urgency, because irreplaceable and valuable digital content is allegedly disappearing.
Why is visible progress towards deployed digital preservation so slow in the United States? A deployment exists in the National Library of Australia. The Koninklijke Bibliotheek (KB) is making steady progress. Almost two years have elapsed since the Congress granted funding for a national digital preservation program managed by the Library of Congress. However, little seems to be happening,[26] except perhaps a higher pace of meetings[27] and broadly based studies that do not include the nitty-gritty of practical engineering.[28]
Is the problem in fact urgent? Is progress in fact slow? An eminent librarian once pointed out that “urgent” and “slow” have different meanings within the Washington beltway than they do to denizens of Silicon Valley.
Is it that the responsible managers believe that prompt action would risk massive wasted effort because unsolved technical problems exist for some kinds of data? If so, they should tell us specifically what these risks are and which data classes are affected. Alternatively, if non-technical risks are the effective impediments,[29] they should be specifically articulated for consideration by the best minds available.
For NARA, perhaps more than for any other
American institution, digital preservation is mission-critical.[30] The
sole big risk that haste might create is that a preservation format used today
for each member of large collections might later prove to have a flaw needing
human intervention for each affected document.[31] However, large classes of government records
are almost surely represented by text files and are sufficiently uniform that
metadata could be semi-automatically generated. This is true for most e-mail.[32] [Reference to Carlin and lawsuit.] Existing digital standards are
surely sufficient to ensure their perpetual interpretability, and conventional
computing center security measures can control authenticity as well as is done
for records on paper. But action towards up-to-date
technology is delayed pending research results.[33]
Phased Deployment of
Digital Preservation
Public discussions of digital preservation suggest that the big American institutions are focused on research, and are paying little attention to technology deployment issues that arise later in practice. Here, “later” means “logically later”, which need not imply “chronologically later”.
If disappearing data truly makes deployment urgent, creating an archive need not be a purely sequential process, but can be managed with concurrent activities. Solution integration and installation is likely to take one to three years even if all research problems were solved. The needed work would be mostly unaffected by different answers to the open questions, partly because some kinds of data encounter no open questions. The issue is how to decide what can be expedited. Careful, specific requirements analyses would help decisions.
Some things can
be done quickly without much risk.
Other things need careful thought.
A “just do it” approach[34] was adopted in digital preservation projects in Australia [Waugh] [Webb] and in the Netherlands [van der Werf]. Their much-admired projects seem to have avoided any risks of future costly rework.
Commercial
Off-the-Shelf Software
Writings from the research library community that touch on technology omit important aspects. They include almost nothing about possible exploitation of commercial offerings,[35] not even arguments why the existing offerings might be unsatisfactory and what needs these do not meet. They articulate too little about their institutions’ needs for software engineers and enterprise planners to be helpful.[36]
Negative statements about commercial content management software are seldom accompanied by anything more specific than, “it seems too complicated”. Discussions of possible prototypes say little about how they will overcome the deficiencies of commercial software. Current enthusiasm for building prototype digital archives is likely to lead to requests for public funding. If so, the proposal referees should demand articulation of specific plans for solving the problems of prior software. For instance, it is all to easy to achieve simplicity by omitting scaling and reliability mechanisms that make production software relatively complicated.
Sustainability and
Impact of Archiving Software
Today’s research goals for digital content preservation[37] should be methods that rapidly come into practice.[38] To be effective, projects must include credible plans for transferring their deliverables to organizations that commit to maintaining them for delivery to archives or to the public, as appropriate. For instance, a project whose output includes software tools needs to ensure that these tools survive usefully beyond the dissolution of the project team.
The nature of deliverables and of technology transfer targets will depend on the kind of research results. For instance, guidelines for classifying documents can only be effective if cataloguers follow them—they are expressions of normative human procedures. In contrast, some metadata can be automatically generated—the implementation would include computer programs. What’s appropriate for bringing software into productive use will not work for normative human procedures.
The needs of two
archival
institutiones will usually haveinclude
significant differences. These might be
differences in scale, in the types of ingestion data streams, in user community
expectations, in organizational training, in software applications that must
interface seamlessly with storage subsystems and catalog management software,
and so on. It is unlikely that a few
prepackaged solutions will be satisfactory or economically viable. To be useful, software deliverables must be
components that can be assembled into enterprise-sensitive systems.
Some information scientists express enthusiasm for archive prototypes and pilots. While a few demonstrated prototypes are desirable, pilot installations will not be enough. Profound differences distinguish prototype/pilot versions and “industrial strength” versions suitable for integration into pre-existing environments with sufficient scaling, error handling, user education, and support infrastructure to satisfy different institutions.[39] “Commercial Off The Shelf” (COTS) software [King] differs from demonstration versions less in its functional features than in its robustness, in its ability to scale from very small to very large data collections,[40] and in having a support infrastructure for customer service.
To convert prototype
software into robust packages suitable for customers that need sustained
productivity, we can distinguish at least three conventional routes. A research project can work with commercial
vendors to transfer responsibility for conversion to “industrial strength” and
eventual customer support. It can
attempt to launch “open source” software[41]. Finally, small packages can be
delivered and maintained by research groups.
The first two alternatives are suitable for server machines; the last
works best for client machines (PCs and workstations).[42]
To be durable, “open
source” software needs a committed user group of sufficient size and stability
to provide on-going problem fixing, integration, functional refinement, and
consulting services. It is not easy to
establish such an interest group; doing so is likely to be achieved only for
widely used components.[43] THowever,
the cultural collection community is probably not large enough for
this; it would need to attract many other communities to the tools that it
wants.
The Australian and Dutch “just do it” digital preservation efforts produced written requirements analyses. Discussions in the EU-NSF Workgroup[1] suggest that such requirements analyses are few and far between among public and educational sector institutions, and that the specificity that they induce would be an helpful addition to the generality communicated by most of the available literature.
Although a formal requirements analysis is usually written only when I/T services or technology are about to be purchased, one can be extremely helpful to distinguish among today’s needs, needs that can wait for several years, and generic needs that are irrelevant to that institution. A requirements statement can expedite informing the institution what is available from existing tools and offerings, what needs to be integrated into the institutional environment, and which pertinent needs require research and development. It can also inform about specific training needed for the staff.
An informal inquiry sent to the DDQ address list in early September stimulated about 40 replies that identified a half-dozen interesting examples. Some time is needed to analyze what they teach; comment before DDQ 1(4) would be premature.
Elsevier Science and KB have agreed that KB will be the first official digital archive for Elsevier journals. KB will receive digital copies of 1,500 Elsevier journals in science, technology, and medicine. Elsevier is digitizing older journal issues going back to v.1 #1. See http://www.infotoday.com/newsbreaks/nb020903-2.htm.
The Swedish government issued a decree relative to Royal Library work acquiring, preserving and making accessible the Swedish Internet. The Library had been collecting Web content since about 1996, but holding it in a dark archive because of intellectual property uncertainties. The decree authorizes archived content access to library premises visitors.
GABRIEL multilingual Web offers consistently structured information about 41 European national libraries, their printed and electronic collections and their online catalogues. See http://www.kb.nl/gabriel/news/2002/contents/relaunch/history_public_launch20020709.html
A policy statement, entitled Preserving the Memory of the World in Perpetuity, was agreed by the International Federation of Library Associations and Institutions (IFLA) and the International Publishers Association (IPA). See http://www.ifla.org/V/press/ifla-ipa02.htm
The Conservation Information Network announced a Web site containing almost 200,000 bibliographic records on conservation. See