|
Digital Document Quarterly Perspectives on
Trustworthy Information |
Volume
1, Number 3, 3Q2002 |
|
|
HMG
Consulting 20044
Glen Brae Drive Saratoga,
CA 95070 (408) 867-5454 |
|
||||||
|
© 2002,
H.M. Gladney |
|
|||||||
In its first number, DDQ projected a 2002 emphasis on digital preservation archiving. The current number is influenced by workgroup discussions of preservation research needed[1] and by similar considerations for a (U.S.) National Academy of Sciences study requested by the National Archives and Records Administration (NARA).
These discussions suggest questions[2] that have been insufficient addressed. In this and future numbers, DDQ will identify a few questions that deserve increased attention and suggest unconventional possibilities. It starts with:
Ø For which communities and which kinds of public documents is digital preservation most important and most urgent? (See Selection Criteria below.)
Ø In view of unavoidable difficulties with language,[3] what are the limits of saving meaning, semantics, and knowledge for future generations? (See Semantics ... below.)
Ø How can deployments be planned so that unanswered technical questions neither hamper nor delay ingesting content? (See Why So Slow ... below.)
Readers might consider their own answers to these questions. To give plausible, effective answers does not require specialized expertise, but rather good judgment.
I formerly thought that selection for preservation was a
difficult challenge, but no longer believe so—at least not in the sense of being hampered by raising
worthwhile, but unanswered, research issueissues. Once the technical and organizational
challenges are overcome,[4]
digital preservation is likely to become a routine activity with priorities set
by each institution’s resource allocation process. Institutional objectives are likely to dominate selection
criteria.
Today’s selection costs are exacerbated by the accelerating transformation from information scarcity to information overflow. Writing and dissemination were relatively rare and relatively slow in earlier centuries. For instance, the British Departments of State had about 50 clerks at the time of the American revolution; these clerks wrote with quill pens; their letters to North America took 6 to 10 weeks to deliver. Compare that to the sizes of today’s bureaucracies and the tools they use to create and disseminate information. Selection is much less challenging for old documents than for modern content; de facto, for old content, we benefit from a form of selection at the source.
It is today neither possible nor desirable to save everything. Decisions will occur, either by default or with varying degrees of care and insight. For governments and ordinary folk, Titanic 2020: A Call for Action [Lysakowski] suggests disaster for office files in popular formats. A concerned reader might start by thinking about which documents he would like to see saved.
In the public sector, the visible efforts towards preservation of “born-digital stuff” are focused on cultural content, on scientific data,[5] and on records of national significance. The public discussion and literature make few allusions to the interests of smaller political units, to educational priorities other than those of research scholars, to judicial systems, to health delivery systems, or to administrative collections of interest to ordinary citizens. Is this appropriate? How might taxpayers prioritize public records for preservation?[6]
For at least two decades, some people have shared a dream of the “longitudinal patient record”—a medical history that accompanied each person from birth to grave. Since the useful lifetime of today’s digital records is much less than healthy human lifetimes, preservation technology would be needed to fulfill this dream.[7]
A personal letter from a schoolmate illustrated other needs:[8]
Speaking
of the [Immigration and Naturalization Service], we are trying to see if [my
son] qualifies for [U.S.] citizenship on the basis of the fact that I did the
border shuffle [between Canada and the U.S.] for most of my natural life. Now it is a question of proving I exist, it
seems.
[I am]
trying to unearth papers to prove to the lawyers that I actually spent about
half my time on either side of the border from birth until I married! Did you know that anyone who attended HS
still in the 1950s is clearly so far back in the Dark Ages as to be almost a
non-person? Welcome to the real
world. The schools in [city] IL, where
I attended the first 3 grades, tell me they have no records of any students
born between 1931 and 1942; so much for that!
The
school board in [location] says [XYZ] High School no
longer exists. "If we had records, they would have been forwarded to the
HS you went to." [The Canadian
city] HS fortunately had registered me as having come in from [XYZ] HS, but kept no transcripts ….
And we haven't even been bombed or anything. No wonder half of people who lose their papers die of
despair. Bureaucracy is immovable! Yet, a front page story about a restaurant
on my block starts out with how the executive chef came to this country as an
illegal immigrant from Mexico! Any time
I have had dealings with the INS they have been expensive and exceedingly
unpleasant. So what else is new?
In 1991, an IBM Research group and a California Department of Transportation (DOT) department considered a digital library pilot for the construction and inspection records of thousands of bridges. During a proposal work-up, they visited the records room of a DOT regional office. It was in a clay-floored basement, with 30-year old drawings, handwritten notes, and typescripts stored in cardboard boxes. A sprinkler system had been installed for fire protection. Imagine working with soggy, partially-burned records! Are critical state records protected today by the digital equivalent of cardboard boxes and sprinkler systems?
Can we devise a way to establish what the public’s priorities would be and to adjust expenditures to represent these better than might today be the case? Or is doing so politically infeasible?
Between 1988 and 1994, scholarly advisory committees considered preservation content selection for History, Renaissance Studies, Philosophy, Mediaeval Studies, Modern Language and Literature, and Art History. [George] While this work was mostly done before digital capture was a practical option, similar considerations seem applicable today.
"One
theme is the understandable reluctance of scholars to make choices because of
the unpredictability of research needs.
Scholars are loath to say, 'this book will be more useful for future
research than that one,' because the history of their fields shows that writers
and subjects that seem inconsequential to scholars in one era may become of
great interest in the next, and vice versa.
Moreover, discovery and serendipity may lead to lines of inquiry
unforeseen. ...” [George]
This suggests a somewhat discouraging prospect for scholarly needs. Students' wants are easier to satisfy, as the secondary school student or college undergraduate assigned a term paper will choose the first pertinent and interesting material that (s)he encounters. Digitization can provide more interesting material than has been commonly available. It is becoming realistic for teachers to require students to find and work from original sources rather than from secondary opinions and other people's selections.
“Knowledge management” seems more prominent today than a few years ago. Under various names, this topic has been considered by the artificial intelligence community for at least three decades, and by librarians grappling with information discovery and library catalogues.[9]
Some people regard knowledge management to be a key component of digital preservation research. Doing so may be appropriate to a limited extent, but supporting pronouncements often blur the boundary with all the rest of scholarship, inevitably diminishing focus on the special challenges of preservation.
What can we preserve of meaning for future generations?
Preservation schema[10] should enable representing whatever authors want to express, within the limits of what language can express. This includes representing any relationships whatsoever among documents. However, a goal demanded by some preservation researchers—completeness of collections—is infeasible because completeness is a value judgment that cannot be expressed objectively.[11] Furthermore, only trivial collections have no references to works outside their own contents.
Are existing methods of representing meaning and linking digital documents sufficient for expressing anything that can be objectively expressed? Some future DDQ number will consider RDF [Beckett], and perhaps also languages invented for artificial intelligence research. However, to start that inquiry, we must remind ourselves of the meaning of “meaning”.
The
Meaning of
“Meaning”: Ontologies
The word ontology is used more now than it was ten years ago, and in a different sense than in philosophy. The reader can easily find many Web pages about what’s intended. For instance,
[For]
knowledge sharing, … ontology [means] specification of a
conceptualization … description
(like a formal specification of a program) of the concepts and relationships
that can exist for … a community of agents.
This definition is consistent with … set-of-concept-definitions, but
more general. …
What
is important is what an ontology is for. … An ontology is … used for making …
commitments. … [W]e choose to write an
ontology as a set of definitions of formal vocabulary. Although this isn't the only way to specify
a conceptualization, it has some nice properties for knowledge sharing … (e.g.,
semantics independent of reader and context).
An ontological commitment is an agreement to use a vocabulary (i.e., ask
queries and make assertions) in a way that is [internally] consistent …. We build agents that commit to
ontologies. We design ontologies [to]
share knowledge with and among these agents.
Tom
Gruber at http://www-ksl.stanford.edu/kst/what-is-an-ontology.html
DDQ uses ontology approximately as Gruber suggests. It is a synonym for reference model as used in [OAIS], and is related to librarians’ subject classification, ACM’s Categories and Subject Descriptors, and recent discussions of The Semantic Web.[12]
The introduction to DDQ held that, “Computers manipulate symbols that are surrogates for what they mean. A computer model is good if its pattern follows the pattern of what it stands for. … language consists of symbols taking meaning from how they are used.” We use reference models to identify what information we are conveying or services we are providing.[13]
My attention was drawn back to the Open Archival Information Systems Reference Model (OAIS) by a referee of a commissioned survey about business enterprises’ archiving. He castigated me for not discussing OAIS, supposing that I did not know of it. He might have been even more scathing had he known that I had noticed that business authors never referred to OAIS, and saw no reason to judge this inappropriate.
The incident stimulated a fresh reading of [OAIS]. The more I considered how OAIS was used in digital preservation articles, the more it puzzled me. I could not firmly determine whether the authors, who mostly were Research Library Group (RLG) affiliates, viewed OAIS as an ontology or were planning to use it as an architecture. Each of ontology and architecture has a role in technical and procedural designs. [OAIS §1.1] clearly articulates intention to provide an ontology. However, its diagrams illustrating relationships can mislead people into reading it to be an architecture. It does cross the line in:
Though
the OAIS reference model does not focus on these emerging techniques, it should
provide architectural basis for the prototyping and comparison of these
techniques. [OAIS, p.2-4]
Excerpts from unpublished draft recommendations[14] illustrate the confusion.
“The
Open Archival Information Systems (OAIS) Reference Model, a draft ISO
standard developed by the space data community with leadership from NASA and
the European Space Agency, is gaining rapid acceptance as a framework for
the basic technical architecture for digital repositories.”
“The
OAIS functional model defines the core functions of a repository as
administration, ingest (more commonly known as accession), archival storage,
data management, access, and preservation planning. Its information model defines various types of information
packages, which if implemented together with the functional model, provide a
means for separating long-term storage of bits or data streams from the
management of data and collections.
This model has been used to build prototype persistent archives that
can preserve data independent of any particular hardware and software
configuration.”
The confusion led to A Bigger Problem Called “OAIS” in DDQ 1(1); this squib stimulated debates that included an OAIS author, but that did not settle the issue.
Consider the following fragment of a reference model for residences—a
fragment cribbed with obvious modifications from the beginning of OAIS Ingest
[OAIS §4.1.1.2].
A residence may contain one or more areas called Kitchens. The functions of a Kitchen are illustrated in:

Figure 1: Kitchen function
(compare Ingest function [OAIS Figure 4-2])
The Receive Groceries function provides the
appropriate storage capability and entrance to receive a shipment from the
Grocer. Groceries may be delivered by
truck, or fetched by the cook, into temporary storage convenient for
unpacking. The Receive Groceries
function will represent a legal transfer of ownership of the groceries, and may
require that special controls be placed on the shipments. This function provides a receipt to the Grocer, which may include a request to send missing items.
The Quality Assurance function validates
correct receipt in the unpacking area.
This might include tasting a sample of each item, and the use of a log
to record and identify any shortfalls.
The Prepare Meal function transforms one or
more packages into one or more dishes that conform to culinary and health standards. This may involve boiling, frying, baking, or blending of contents
of grocery shipments. The Cooking
function may issue recipe requests
to a cookbook to obtain descriptions needed to produce the menu. This
function sends sample dishes for
approval to a critic, and
receives back an appraisal.
The Generate Menu function extracts …
This fragment suggests how we might map the OAIS reference model onto our residence model. Each OAIS function would correspond to a residence area.
How much does this reference model help towards building a
residence? It provides builders and
eventual residents a shared vocabulary.
However, each builder also needs instructions what kind of residence to
construct: a single family detached home, an apartment building, a military
barracks, a college residence, or a prison?
Just as our reference model says what it means to be a place to live—a residence,
[OAIS] articulates what it means to be a place to hold
information—a library or archive.
What’s missing is an architecture.
Instructions for a builder should include dimensions, location, and many other factors. Such detail would not appear in our reference model, just as [OAIS] does not distinguish among a research library, a state government archive, a corporate archive, or a personal collection. Missing in each case is high level design that differentiates among structural alternatives, quantifies spaces, resources, and flows, describes materials and surface finishes, specifies utilities and safety factors, and so on.
How much qualitative and quantitative detail must an architecture express? The customer decides. He will often accept conventional levels and styles of description, but will also have his own ideas and emotions about what is important. The architecture would describe every aspect on which the customer insists. It would be an essential part of a prudent construction contract.
[OAIS] was first published in 1999, and has been only slightly refined since then. It is disturbing that we can find few published architectures for digital archives, and that those we find are not paid attention to by the preservation community.[15]
Articles talk about OAIS. Librarians express concern about rapidly disappearing digital content. Nevertheless, in three years, nobody seems to have taken the next steps prescribed by conventional engineering practice. The Tractatus Logico-Philosophicus anticipates an apt summary for [OAIS] as a context for preservation archiving.
“… the truth of the thoughts that are here communicated seems to me unassailable and definitive. I therefore believe myself to have found, on all essential points, the final solution of the problems. And if I am not mistaken in this belief, then the second thing in which the value of this work consists is that it shows how little is achieved when these problems are solved.” Ludwig Wittgenstein, TLP Introduction
Consensus as an
Impediment to Progress
The OAIS-related problems raised by DDQ have to do with how
it is misunderstood, rather than with the content of the ISO proposal. Acceptance of OAIS by the research
library community is insufficiently forward-looking. Since OAIS first became popular about 3
years ago, little technical progress has occurred in the areas which it
articulates.
Rapid progress is impeded by the library community’s emphasis on consensus, when consensus is used to squelch productive conversation, particularly between librarians and engineers.[16] The research library community seems to value consensus so highly as to blind it to other values, such as the productivity of professional debate. [Marcum]
This criticism does not imply that consensus is unimportant. Consensus is, in fact, critical when action would be ineffective or inefficient without agreement about carefully selected aspects. These often exist in service delivery situations, such as common behavior in many libraries from which readers benefit. It is extremely valuable that any of us can, without special training or assistance, use the card catalogue of almost any research library in the world. Similarly, in the emerging digital world, we need to agree on information interchange standards. However, what works well for service delivery can be an immense nuisance when it is not yet decided what service should be delivered and how best to deliver it.
Behavior that’s good for running today’s libraries
can be an impediment to inquiring how to run tomorrow’sfuture
libraries. Research thrives on debate
that includes candid criticism of proposed solutions.[17] Researchers see little point in writing or
talking about what they agree with.[18] We make little progress by discussing
favorably what's written. To see what
we should do next or what research problems might exist, we must pay attention
to what is not yet satisfactorily addressed.
I’ve encountered misuse of consensus in discussions of OAIS. Several times in the last year, the response to a technical objection or question did not address the point raised, but rather was along the lines of “there is community consensus for OAIS,” and it seemed that the respondent was surprised that this did not satisfy me.
That OAIS has the approval of many research librarians is unsurprising. After all, [OAIS] is an ontology for how librarians refer to what they already do, extended to digitally conveyed information. For excellent reasons, it is conservative—laying language groundwork for minimum disruption in library processes by the addition of digital media to physical information carriers. To the extent that professional librarians feel OAIS accomplishes minimum change, they will approve.
Furthermore, librarians are encouraged by OAIS consensus among their peers, because their community values consensus so highly. However, emphasizing consensus rather than encouraging debate sets the stage for insufficient attention to research opportunities and potential economies.[19]
Concern about OAIS-focus would be silly without attractive alternatives. Trustworthy, Durable Digital Documents (TDDD) in DDQ 1(2) suggests another way of looking at the preservation challenge; there may be further alternatives. However, with few services committed to any particular approach to digital preservation yet, we should consider questions before jumping to proposed solutions. Collectively, the right questions are likely to suggest a much less expensive solution than seems inherent in today’s dominant focus.
[OAIS] asks, "What reference model for a research library (people, resources, processes, ...) is appropriate for archiving?" TDDD asks, "What characteristics will make document representations useful into the indefinite future?"
Such different questions are unlikely to lead to the same answer. The TDDD question suggests a simpler solution that is compatible with likely solutions to the OAIS question.
Some future DDQ number will include more about TDDD. For the time being, however, it seems best to consider direct benefits of focusing on digital document structures. The common feature of the following points is that all specifications are in terms of input/output or boundary conditions—it’s a “black box” approach that says next to nothing about the internal workings of an archive.[20]
Ø
Archive customers (information producers and consumers)
see and care only about document characteristics[21]
and how to find helpful documents. They
are not interested in how remote servicemachines accomplish
what they ask for—just how well the servicmachines
do so.[22]
Ø To specify static document properties and external properties of services (such as performance) is much easier than specifying how computers and service organizations should accomplish what they promise. Furthermore, we can easily test document properties and search performance. We would find it much more difficult to test the internal workings of an archive.
Ø Archive managers gain immense flexibility and freedom of action by committing only to input/output behavior. This flexibility will be immediately useful in acquiring or creating the technical components that accommodate idiosyncratic characteristics of each archiving institution.[23] It also clarifies what can be allowed to change with technology evolution.[24]
Ø Promising users and stakeholders only input/output specifications much reduces managers’ need to let outsiders meddle with the workings of archives. In the near future, we will show that external audits to establish trust can be much simplified from what [RLG2] suggests.
Ø
It is, a priori, not obvious that archives along the
style of current institutions will be either the most responsive or the most
economical for born-digital content.[25] Requiring conformance only to “black box”
properties[2016] will facilitate marketplace
exploration and experimentation.
Since the challenges were articulated in 1996 [Garrett], many conferences have been held and many papers have been written on the topic. They include reminders of urgency, because irreplaceable and valuable digital content is allegedly disappearing.
Why is visible progress towards deployed digital preservation so slow in the United States? A deployment exists in the National Library of Australia. The Koninklijke Bibliotheek (KB) is making steady progress. Almost two years have elapsed since the Congress granted funding for a national digital preservation program managed by the Library of Congress. However, little seems to be happening,[26] except perhaps a higher pace of meetings[27] and broadly based studies that do not include the nitty-gritty of practical engineering.[28]
Is the problem in fact urgent? Is progress in fact slow? An eminent librarian once pointed out that “urgent” and “slow” have different meanings within the Washington beltway than they do to denizens of Silicon Valley.
Is it that the responsible managers believe that prompt action would risk massive wasted effort because unsolved technical problems exist for some kinds of data? If so, they should tell us specifically what these risks are and which data classes are affected. Alternatively, if non-technical risks are the effective impediments,[29] they should be specifically articulated for consideration by the best minds available.
For NARA, perhaps more than for any other
American institution, digital preservation is mission-critical.[30] The
sole big risk that haste might create is that a preservation format used today
for each member of large collections might later prove to have a flaw needing
human intervention for each affected document.[31] However, large classes of government records
are almost surely represented by text files and are sufficiently uniform that
metadata could be semi-automatically generated. This is true for most e-mail.[32] [Reference to Carlin and lawsuit.] Existing digital standards are
surely sufficient to ensure their perpetual interpretability, and conventional
computing center security measures can control authenticity as well as is done
for records on paper. But action towards up-to-date
technology is delayed pending research results.[33]
Phased Deployment of
Digital Preservation
Public discussions of digital preservation suggest that the big American institutions are focused on research, and are paying little attention to technology deployment issues that arise later in practice. Here, “later” means “logically later”, which need not imply “chronologically later”.
If disappearing data truly makes deployment urgent, creating an archive need not be a purely sequential process, but can be managed with concurrent activities. Solution integration and installation is likely to take one to three years even if all research problems were solved. The needed work would be mostly unaffected by different answers to the open questions, partly because some kinds of data encounter no open questions. The issue is how to decide what can be expedited. Careful, specific requirements analyses would help decisions.
Some things can
be done quickly without much risk.
Other things need careful thought.
A “just do it” approach[34] was adopted in digital preservation projects in Australia [Waugh] [Webb] and in the Netherlands [van der Werf]. Their much-admired projects seem to have avoided any risks of future costly rework.
Commercial
Off-the-Shelf Software
Writings from the research library community that touch on technology omit important aspects. They include almost nothing about possible exploitation of commercial offerings,[35] not even arguments why the existing offerings might be unsatisfactory and what needs these do not meet. They articulate too little about their institutions’ needs for software engineers and enterprise planners to be helpful.[36]
Negative statements about commercial content management software are seldom accompanied by anything more specific than, “it seems too complicated”. Discussions of possible prototypes say little about how they will overcome the deficiencies of commercial software. Current enthusiasm for building prototype digital archives is likely to lead to requests for public funding. If so, the proposal referees should demand articulation of specific plans for solving the problems of prior software. For instance, it is all to easy to achieve simplicity by omitting scaling and reliability mechanisms that make production software relatively complicated.
Sustainability and
Impact of Archiving Software
Today’s research goals for digital content preservation[37] should be methods that rapidly come into practice.[38] To be effective, projects must include credible plans for transferring their deliverables to organizations that commit to maintaining them for delivery to archives or to the public, as appropriate. For instance, a project whose output includes software tools needs to ensure that these tools survive usefully beyond the dissolution of the project team.
The nature of deliverables and of technology transfer targets will depend on the kind of research results. For instance, guidelines for classifying documents can only be effective if cataloguers follow them—they are expressions of normative human procedures. In contrast, some metadata can be automatically generated—the implementation would include computer programs. What’s appropriate for bringing software into productive use will not work for normative human procedures.
The needs of two
archival
institutiones will usually haveinclude
significant differences. These might be
differences in scale, in the types of ingestion data streams, in user community
expectations, in organizational training, in software applications that must
interface seamlessly with storage subsystems and catalog management software,
and so on. It is unlikely that a few
prepackaged solutions will be satisfactory or economically viable. To be useful, software deliverables must be
components that can be assembled into enterprise-sensitive systems.
Some information scientists express enthusiasm for archive prototypes and pilots. While a few demonstrated prototypes are desirable, pilot installations will not be enough. Profound differences distinguish prototype/pilot versions and “industrial strength” versions suitable for integration into pre-existing environments with sufficient scaling, error handling, user education, and support infrastructure to satisfy different institutions.[39] “Commercial Off The Shelf” (COTS) software [King] differs from demonstration versions less in its functional features than in its robustness, in its ability to scale from very small to very large data collections,[40] and in having a support infrastructure for customer service.
To convert prototype
software into robust packages suitable for customers that need sustained
productivity, we can distinguish at least three conventional routes. A research project can work with commercial
vendors to transfer responsibility for conversion to “industrial strength” and
eventual customer support. It can
attempt to launch “open source” software[41]. Finally, small packages can be
delivered and maintained by research groups.
The first two alternatives are suitable for server machines; the last
works best for client machines (PCs and workstations).[42]
To be durable, “open
source” software needs a committed user group of sufficient size and stability
to provide on-going problem fixing, integration, functional refinement, and
consulting services. It is not easy to
establish such an interest group; doing so is likely to be achieved only for
widely used components.[43] THowever,
the cultural collection community is probably not large enough for
this; it would need to attract many other communities to the tools that it
wants.
The Australian and Dutch “just do it” digital preservation efforts produced written requirements analyses. Discussions in the EU-NSF Workgroup[1] suggest that such requirements analyses are few and far between among public and educational sector institutions, and that the specificity that they induce would be an helpful addition to the generality communicated by most of the available literature.
Although a formal requirements analysis is usually written only when I/T services or technology are about to be purchased, one can be extremely helpful to distinguish among today’s needs, needs that can wait for several years, and generic needs that are irrelevant to that institution. A requirements statement can expedite informing the institution what is available from existing tools and offerings, what needs to be integrated into the institutional environment, and which pertinent needs require research and development. It can also inform about specific training needed for the staff.
An informal inquiry sent to the DDQ address list in early September stimulated about 40 replies that identified a half-dozen interesting examples. Some time is needed to analyze what they teach; comment before DDQ 1(4) would be premature.
Elsevier Science and KB have agreed that KB will be the first official digital archive for Elsevier journals. KB will receive digital copies of 1,500 Elsevier journals in science, technology, and medicine. Elsevier is digitizing older journal issues going back to v.1 #1. See http://www.infotoday.com/newsbreaks/nb020903-2.htm.
The Swedish government issued a decree relative to Royal Library work acquiring, preserving and making accessible the Swedish Internet. The Library had been collecting Web content since about 1996, but holding it in a dark archive because of intellectual property uncertainties. The decree authorizes archived content access to library premises visitors.
GABRIEL multilingual Web offers consistently structured information about 41 European national libraries, their printed and electronic collections and their online catalogues. See http://www.kb.nl/gabriel/news/2002/contents/relaunch/history_public_launch20020709.html
A policy statement, entitled Preserving the Memory of the World in Perpetuity, was agreed by the International Federation of Library Associations and Institutions (IFLA) and the International Publishers Association (IPA). See http://www.ifla.org/V/press/ifla-ipa02.htm
The Conservation Information Network announced a Web site containing almost 200,000 bibliographic records on conservation. See http://www.bcin.ca.
MIT has just announced its plan for free online access to all its curriculum content. It hopes to start a world-wide university trend towards a profound impact on learning and education. The BBC report of the announcement is at http://news.bbc.co.uk/1/hi/technology/2270648.stm.
Digital preservation articles are becoming too numerous for most readers. DDQ recommends:
Ø YEA: the Yale Electronic Archive … on Digital Preservation Planning offers insights by an institution which has been considering key questions longer and more deeply than is common. See http://www.library.yale.edu/~okerson/yea/.
Ø
Information
Management: Challenges in Managing and Preserving Electronic Records is the
U.S. General Accounting Office assessment of “the status and adequacy of NARA’s response to [preservation] challenges”. See http://www.gao.gov/cgi-bin/getrpt?GAO-02-586.
Ø
The State
of Digital Preservation: An International Perspective collects
and refines presentations from an April 2002 meeting, giving an accurate status
summary. See http://www.clir.org/pubs/abstract/pub107abst.html.
Good design is sometimes available inexpensively. You will probably appreciate the mobile laptop station depicted. It's SRP is $160. However it is steadily available in local outlets for $40, and is often advertised at $30. Compare it to an $860 functional equivalent!

If you use your computer in the same room as a high-fidelity sound system, you may want to connect them for listening to distant Internet broadcast stations.[44]

What is it that a copyright protects? [Nimmer] analyzes a moot case about a published work whose last copy is destroyed by fire. If a work has once been published in tangible form, copyright protects the abstract pattern represented in the publication. In copyright law, what is essential about a work is a pattern inherent in its reproductive instances.
Alan Turing is the eponym for the most prestigious Computer Science award. Alan Hodges’ authoritative biography, Alan Turing: the Enigma, is good reading, even for scientific laymen.
Can a machine think? For about a century, scholars have grappled with this question, sometimes in terms appealing to laymen and sometimes in ways comprehensible only to specialists. Justin Leiber’s An Invitation to Cognitive Science relates Turing's work in mechanical logic, Wittgenstein's philosophic development, and modern issues in the design and application of automation. It is not an easy book, but will reward its readers.
DDQ 1(3) owes much to John Bennett’s and John Swinden’s critically constructive comments. Discussions with Reagan Moore helped focus technical issues in this DDQ number, whose endnotes 5, 9, 12, and 13 are slightly amended comments from him.

on
technical issues.
Digital Preservation: Key Questions
Selection Criteria: What’s Worth Saving?
Semantics and Knowledge Management
The Meaning of “Meaning”: Ontologies
Consensus as an Impediment to Progress
Why So Slow Towards Practical Preservation?
Phased Deployment of Digital Preservation
Commercial Off-the-Shelf Software
Sustainability and Impact of Archiving Software
Koninklijke Bibliotheek (KB) Archives Elsevier
Periodicals
Swedish Government Articulates Policy on Web
Collecting
Renewed GAteway and BRIdge to Europe's National
Libraries
International Policy Statement on Digital
Preservation
Free Resources for Conservation Professionals
Recent Reports on Digital Preservation
[1] The EU and NSF have commissioned a workgroup to recommend questions for research funding. Such questions know no international borders, but European and American programs must be administratively independent.
[2] “…the first task in the effort is not to
posit answers, but to frame questions and issues in such a way as to engage the
many parties already working in various ways with digital information so that
they can help us understand the relevant issues.” [Garrett
page 7]
[3] The difficulties are intrinsic and apply equally to natural languages and digital expressions. See How Can We Use Wittgenstein’s Philosophy? in DDQ 1(2).
[4] The funding challenges are likely to continue, because more content will forever be generated than can be saved. Intellectual property rights (mainly copyright) issues include conflicting interests that will not be quickly resolved.
[5] Examples include the 2-Micron All Sky Survey (10 TBs of data, 5 million images), the NSF Digital Library (preservation of curricula modules). The projects are driven by the research communities that use the data.
[6] Among political issues that include international terrorism, global warming, hunger and illness in Africa, and world trade rivalries, it would be naïve to expect most taxpayers to know or care much about their personal risks associated with disappearing documents.
[7] Digital preservation is not the most daunting challenge to realization of lifelong health records. Patient privacy, information standards, and medical system infrastructure are more challenging.
[8] Details are blurred to protect privacy. The missing records were on paper, but we can easily imagine a similar digital scenario thirty years from now.
[9] Archivists use fonds to organize records in collections. The archivists have the challenge of preserving the semantic meaning of the terms that they use in the collections to support discovery of individual records. Some people argue that management of semantics archivists’ responsibility. The DDQ position is that the phrase “management of semantics” can be construed so broadly as to include all of scholarship, so that finer distinctions are essential before the underlying issue can be sensibly debated.
[10] Some future DDQ number will explain schema for the reader not already familiar with the idea.
[11] The only testable expression of what a collection should contain is a list of the members it should contain. Such a list contains no objective information about any unlisted members. What makes any other criterion for completeness subjective is illustrated by the [LW 39] discussion of “intention”.
[12] The METS standards community is looking at its
definitions of metadata, [seeking a mapping] from its representation of
compound records to the semantics used by OAIS …. A similar analysis [might be] needed to map from
the metadata attributes used to describe archival processes to the semantics
used by OAIS to manage preservation.
In both cases, the description of knowledge (relationships between semantic terms) is not presently described in OAIS. There is an opportunity to augment the OAIS description to add knowledge information packages (KIPs) to help define the context under which a fond is organized ….
[13] Information is created by applying a semantic
label to data. One can then name the
components of data sets, name features in data sets, name attributes that are
assigned to data sets.
An ontology specifies how the semantic labels are organized. We have multiple ways to construct ontologies. We can specify logical relationships between the semantic labels and create a concept map. We can specify procedural relationships between the digital entities, and create process maps or work flows. We can specify spatial relationships between the semantic labels and create atlases. The digital library community has been using ontologies to represent the logical relationships between semantic labels.
[14] This material was input to the EU-NSF WG mentioned in footnote 1. The bold marking is added by DDQ.
[15] [Cooper] does describe the novel aspects of a digital archive—that is, the aspects not inherent in commercial digital library offerings. Since this paper is constrained by the conventional style of a conference paper, it is not an architectural document.
[16] What follows is partly based on personal experiences for which candid descriptions would be indiscrete. My opinions on the topic are neither unique nor original. For instance, see [Marcum]. Eminent librarians have pointed out that the research library community does not like to move until it feels it has broad consensus.
[17] A frequent social problem is that some people do not distinguish between an attack on an idea and an attack on (the competence of) the person who expressed the idea—a so-called “ad hominem” attack. To me it seems that this confusion occurs much less frequently in the scientific and engineering professions than among (the limited number of) other professionals that I’ve encountered.
[18] In a engineering meeting, saying "We’re in violent agreement" is a signal to move on to a different topic.
[19] Understanding Computers and Cognition [Winograd] discusses this for computing service design, using the term “breakdown” in a way that can be traced back to the philosopher, Martin Heidegger.
[20] Engineers will recognize the familiar properties of “black box” specification, and will anticipate what follows. For anyone not familiar with the jargon, a black box is a piece of machinery whose properties will be discussed without “opening it up” to inspect and comment on its inner workings.
[21] Any information to be shared, including programs, can be packaged as a set of documents. This is partly why XML is taking over most information exchange, including even commands from one computer to another.
[22] When you use an automobile, you need understand and control only what happens at the machine boundaries. It took many years for the automobile industry to hide the internal workings from drivers.
[23] As written, this says nothing either for or against sharing ideas, methods, technical components, or personnel training among institutions. In fact, it will allow institutions to choose which elements they adopt from each other.
[24] [OAIS §1.3] says that “establishing minimum requirements for an OAIS archive along with a set of archival concepts, will provide a common framework from which to view archival challenges.” Any formulation of “minimal requirements” can be mapped into input/output specifications plus internal needs of the archiving institution.
[25] This echoes an opinion expressed by Donald Waters, among others, in [CLIR 107], viz., “One of the surprising findings that the Mellon Foundation has made in monitoring these projects is that new organizations are likely go-ing to be necessary to act in the broad interest of the scholarly community and to mediate the interests of libraries and publishers.”
[26] It could be the case that the real activity is hidden from the public (and this author) in quiet councils.
[27] For instance, see the [CLIR 107] report by Laura Campbell, Assoc. Librarian at the Library of Congress (LoC). Five years ago, LoC was much criticized for inadequate relationships with other institutions, and is therefore to be commended for activities that this report illustrates. However, talking and acting could be concurrent. Ms. Campbell writes, “Working with representatives of the Global Business Network, we created an agenda to bring before yet another group of industry experts. We talked about a timeframe of roughly 10 years. (It did not seem useful to go farther because we are struggling with what even the next three to five years might look like.) … Our planning with industry representatives has created a sense of urgency. We call it the "just do it" approach; its aim is to start collecting things before they are lost forever.”
[28] Titia van der Werf, KB, writes, “The national approaches that are now being started … can distract institutions from just going ahead and acting. Moreover, some organizations, national archives as well as national libraries, seem to be stuck in the requirements-specification stage and find it difficult to move forward to implementation, perhaps out of fear of making mistakes.” [CLIR 107]
[29] [Thibodeau 2] addresses only technical issues.
[30] See http://archiveseleanor.nara.gov/publications/prologue/spring_2001_archives_of_the_future.html, Building NARA's "Archives of the Future". NARA “has decided to build … an Electronic Records Archives (ERA) to preserve … digital government [records]. As a result, this pressing need to find a permanent method for long-term preservation of large quantities of electronic records is one of NARA's top priorities … .”
[31] Corrections that could be implemented with conversion programs would not pose a serious problem, because streaming even immense collections through a digital filter would be inexpensive on a per-document basis.
[32] An article on State of Texas e-mail archiving appeared as this DDQ number was almost ready for release. See http://www.dlib.org/dlib/september02/galloway/09galloway.html.
[33] “The timetable for the various stages of the research, design, and construction of the ERA is fluid, because NARA is still doing research and development and planning, says Kenneth Thibodeau, the ERA program director at NARA. An initial system, he says, would not be able to handle the full load of records, nor would it have all the features NARA envisions for ERA, but it would be the basic system that could be expanded and improved.
“’For now, the ERA is still in the
research stage. Various subsystems of
the ERA are being tested along the way, ‘showing us new possibilities we hadn't
thought of yet,’ Thibodeau says.” Loc.
cit. footnote 3026.
This should not be read to imply that NARA
has no digital collections. It has,
in fact, been collecting digital information for about 30 years. For portals, see the “Center for Electronic
Records Materials” paragraph at http://www.archives.gov/research_room/media_formats/electronic_records.html
[34] In thinking about how national models for digital archiving may develop, … return to the principles of responsibility … and the impact they have had in Australia.
§ “Everyone doesn’t have to do everything.”
§ “We don’t have to do everything at once.”
§ “Responsibility can be time constrained.” … our roles may be time constrained and effective exit strategies and succession plans are essential.
These principles … are only valid in the context of some other related principles:
§ “We may not all have to do everything, but someone has to do something.”
§ “Someone must be willing to take a lead on almost all steps.”
§ “In the last resort, someone must be willing to take responsibility for everything, even if it is only responsibility for a final decision that some information will be lost.” [Webb]
[35] Papers about the KB project and the Yale Electronic Archive are exceptions. The KB archive management software is an adaptation of IBM Content Manager that proves to be OAIS-compliant. [van der Werf]
[36] Arguably, this level of detail is mostly inappropriate for publications. However, certain kinds of detail do appear, such as metadata schema. Furthermore, inquiry suggests that the supporting institutions do not have the detail.
[37] This subsection is derived from material that I drafted for the EU-NSF Workgroup on Digital Preservation.
[38] This is in contrast to the goals for many
other research initiatives, which are intended to augment knowledge. To the limited extent that preservation is
for the research community, its role is as a practical tool, not per se.
[39] To make software affordable for any single customer, it must satisfy widely diverse customers. For instance, a toolkit designed and packaged for large research libraries will not be appropriate for of small colleges without significant alterations, and is likely to cost too much for research libraries unless it attracts a large customer base.
Software vendors typically improve reliability, performance, and platform compatibility for 3 to 10 years after first versions are delivered to customers. This process is rarely suitable for anything but a commercial operation because it requires a customer service organization that accepts, analyzes, and responds to problems and requests for product adaptation. Providers typically have orderly procedures for deciding which improvements will please customers, phasing these into so-called software releases, performing extensive tests to be sure that the changes do not interfere with prior functionality, and teaching customers how to exploit the improvements.
[40] Product source code is typically 4 to 10 times larger than that of the prototype it is derived from. Much of the extra code handles exceptions—consequences of end user inputs not conforming to defined constraints.
Rarely anticipated during early usage are so-called “race conditions” and “deadlocks” that occur when many concurrent users create many execution threads competing for data. Absent a product support organization, by the time such problems emerge years after the offering was first deployed, no-one knows how to overcome them.
[41] See Matthew Broersma, Governmments Need
Open Source, ZDNet News, http://zdnet.com.com/2100-1104-955282.html,
August 2002. "A new study has
recommended that governments require the use of open-source software, fanning
the flames of the increasingly heated debate over the place of open-source in
public policy."
[42] Open-source tools for workstations are particularly popular. We can expect many to emerge for editing and rendering complex document formats.
An illustrative instance is the Versioning Machine being developed by Jose Chua, Amit Kumar, and Susan Schreian at the Maryland Institute for Technology in the Humanities (MITH). It is intended to offer ways of seeing all the various forms of a text—and, with these, the text beneath, within, or above these various forms. See http://mith2.umd.edu/products/ver-mach/index.html.
[43] An example of successful open-source software is the Apache web server package.
[44] See Bruce Fries, The MP3 and Internet Audio Handbook, ISBN# 1-928791-10-7. Chapter 10, Connecting Your PC to Your Stereo, is available at http://www.teamcombooks.com/mp3handbook/10.htm.