|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 2, Number 3, 3Q2003 |
|
|
|
|
|||
|
|
HMG
Consulting 20044
Glen Brae Drive Saratoga, CA 95070 |
©
2003, H.M. Gladney |
The cultural heritage community has characterized digital preservation as “urgent”, without identifying what technology is missing. We had hoped the Plan for the National Digital Information Infrastructure and Preservation Program (referred to as “The Plan” below) would fill this gap. Unfortunately, it does not.
“Digital technology … has spawned a surfeit of information that is extremely fragile, inherently
impermanent, and difficult to assess for long-term value. … it is increasingly difficult for libraries
to identify what is of value, to acquire it, and to ensure its longevity over
time.
“Never has access to information that is authentic, reliable, and complete been more important, and never has the capacity of … heritage institutions to guarantee that access been in greater jeopardy. Recognizing the value that the preservation of past knowledge has played …, the U.S. Congress seeks … solutions to the challenges [of] … preserving digital information of cultural and social significance.” The Plan page 1
To a software engineer intending to contribute to digital preservation, the February 2003 Plan for the National Digital Information Infrastructure and Preservation Program is perplexing. Its generalities and ambiguities make it difficult to extract what engineers expect to find in a plan.[1]
We expect a plan to articulate concisely each objective, the resources needed to meet it, commitments to specific actions, a schedule for each technology or service delivery, and a prescription for measuring outcomes and quality.[2] If the plan is for a large project, we expect it to be expressed in portions that separate teams can address relatively independently, and that a plan document exists for each team.[3] We further expect concise descriptions of the environment—business and social circumstances that the participants cannot substantially change. If an environmental factor is adverse, we expect the plan to indicate how the team will bypass or mitigate its effects. If the resources currently available are inadequate, we expect the plan to identify each shortfall. Finally, if the team has already worked on the topic, we expect its plan to list its prior achievements.
Engineers want questions that can be answered objectively by testable facts. They expect documents clear enough so that every participant and every qualified observer can understand what is committed and what work is not authorized, and can judge whether committed progress is being achieved.
However, The Plan identifies few technical specifics, no target dates, and few objective success measures. This is troubling for an initiative launched almost three years ago.
“… these problems are urgent; … action is needed now, not some time in the future.” The Plan page 3
“… the best strategy is to get into the learning loop as
quickly … as possible. While it is
impossible to know now what approach will be best, it is very realistic to make
step-wise … progress …
The
Plan Appendix page 230
The research library community has long voiced such expressions of urgency. We would like to know:
· What specifically is meant by “urgent”? What unrecoverable damage is being caused by the seemingly slow pace? [4]
· What technical needs does The Plan express that were unknown in 1996? [5]
· How can candidate content be graded from ‘can be handled well today’ to ‘not yet tractable’?
· How soon will the LoC “develop the components of the preservation architecture” (The Plan page 7)?
“To begin building the [National Digital] preservation infrastructure, the Library proposes a strategy for working on developing a network of participants and building the technical framework.” The Plan page 5
“This document reports … what has been learned from a variety of activities [and] proposes … actions … to begin practical applications and modeling … implementation of NDIIPP.” The Plan page 11
The Plan promises that “current information on the
program’s status” will be posted on its Web page. However, the NDIIPP team has provided little
information since The Plan was published in February 2003. Readers would be interested in answers to
the following questions:
·
What has LoC accomplished towards meeting 2000 National
Academies’ recommendations for improving its digital skills, resources, and
internal infrastructure? [6]
· How will LoC respond to U.S. GAO guidelines for preserving digital records? [7]
·
A National Academies committee has recommended actions
for NARA’s Electronic Records Project.[8] What parts of this recommendation apply in
principle, and how does LoC plan to address them?
“The vision of NDIIPP
is to ensure the access over time to a rich body of digital content through the establishment of a national network of
committed partners, collaborating in
a digital preservation architecture with defined roles and responsibilities.” The
Plan page 5
“The digital preservation infrastructure will be
characterized by a complex network of relationships and dependencies … unknown
in the world of print and analog …
resources.” [9] The
Plan page 17
“… there is a strong need to [clarify] roles, [to] offer flexibility, and to provide focusing … for institutions to [choose] if and how … to participate in … digital preservation.” The Plan Appendix page 230
The Plan repeatedly emphasizes consensus. The library community has excellent formal and informal collaborative structure—as strong as that of any professional community. We see pervasive tension between achieving consensus and helping each institution address its own priorities.
· Much consultation and consensus building occurs informally and through organizations such as ALA, DLF, IFLA, OCLC, RLG, SLA, and Internet and WWW standardization bodies. What’s missing?
· What tools and rules would enable each institutional or individual contributor to participate in the emerging information infrastructure with no more than minimal administrative overhead?
· Which questions and aspects of The Plan require consensus on standard practices? Which questions concern only methodology within each independent institution? [10]
These questions can, in part, be reframed as:
· What is a sufficient set of standards for digital information interchange and interoperability? How must we augment or extend existing standards?
“[NDIIPP plan] goals
called for the following planning steps, all accomplished in the past 18 months: …
developing a digital preservation architecture that establishes critical
consensus on technical approaches.”
The
Plan page 14
“What most distinguishes the digital preservation context
from the analog one now in place … is the
sheer scale of it. It comprehends
vastly larger amounts of information, … distributed in new venues to a larger
and more heterogeneous user base.” The
Plan page 17
“[In] the American Memory program, the Library of Congress led an effort to digitize more than 100 historical collections … [Its] more than 7 million items … are used daily by teachers, students, scholars, genealogists, [and] private citizens. The Digital Library Initiatives, sponsored by the [NSF], [DARPA], the [NLM], LoC, [NASA], and [NEH], fostered [R&D] for hundreds of digital libraries … [that] have evolved into critical research resources …. These resources need to be maintained … to protect several hundred millions of dollars invested to digitize, organize, and provide access.” The Plan Appendix page 209
The Plan is almost silent about what was learned by these Government investments.[11] It is silent about current content management[12] packages, even the open source packages favored by academia. Nor does it differentiate needed preservation functionality from what digital library technology has provided for five years and longer. What is the architecture for which consensus has been achieved? [13]
The Plan Appendix 9 reads like the start of a requirements statement. Inspection of current digital library offerings would reveal that most, if not all, have the layering it calls for. In fact, more elaborate layering is desirable to shield customers from disruption caused by unpredictable changes, and to allow a repository or an end user to integrate technology from competing vendors. Figure 1 suggests architecture implemented in readily available offerings.
|
|
|
Figure 1: Technology layering in "industrial strength" content management offerings: a solid line between layers
depicts a standard interface; a fuzzy boundary ( |
· What does LoC require beyond what available “industrial strength” software already provides? [14]
· What digital preservation challenges are not addressed by the current digital library literature? [15]
· What is the NDIIPP plan to respond to “the sheer scale of it”? What can be learned about the architecture by talking to managers of the largest digital collections in service today? [16]
· Have the NDIIPP participants estimated how many documents are needed to make each kind of collection useful, how much the accession by the best current methods will cost, and the budgetary impacts and research priority implications of such estimates?
Technology informs almost every aspect of long-term preservation. It is not widely believed that there will be a single solution or that solutions can be achieved solely through technological means. Technological complexities vary across formats, but there is consensus around [challenges listed in the appendix]. It is also important to begin working with material, both to capture valuable but highly ephemeral items and to test possible technical solutions. The Plan Appendix 1 page 4
The Plan is written almost as if a digital information infrastructure did not already exist. In fact, what’s available to research workers is an amazing improvement over what they worked with a decade ago.[17] What seems to be missing is inexpensive preservation methodology.
To design a comprehensive infrastructure for information preservation, we must consider the entire communication channel from each information producer to each eventual consumer, asking:
· How can today’s authors and editors ensure that eventual consumers can interpret information saved today, or otherwise use it as intended?
· What provenance and authenticity information will eventual information consumers find useful?
· How can we make authenticity evidence sufficiently reliable, even for sensitive documents? [18]
· How can we make the repository network robust, i.e., insensitive to failures and proof against the loss of the pattern that represents any particular information object? [19]
· How can we minimize the library accession cost of each digital library holding?
· How can we motivate authors and editors to provide descriptive and evidentiary metadata as a by-product of their efforts, thereby shifting effort and cost from repository institutions? [20]
These questions focus on end-to-end relationships between each information producer and each eventual consumer, rather than on the design of repositories. Such questions might have appeared in The Plan, had it considered end user needs more than it does. Possible responses include whatever The Plan might evoke and also possibilities unlikely to emerge from the current NDIIPP plan process.
It might surprise the reader that good responses are known for all these questions.[21] These answers are mostly not yet validated, and mostly unknown to the library community, much less accepted by that community. Nor is it clear that they are optimal. However, the peer examination and testing needed would be fairly straightforward tasks. The NDIIPP plan should challenge the research community to devise and “sell” solutions better than those already known.
“[T]here is no clear solution or set of solutions to meet
the challenges of digital preservation.
The unpredictability of
technological development, … and [of] the global political environment …
contribute to the challenge of plotting a
course in the face of a wide range of possible
futures.
“To that end, the Library undertook … to identify collectively the key driving forces and variables in the foreseeable future … to prepare possible futures for the Library …” The Plan page 19
“Most importantly, the [NDIIPP planning] process provided … just enough structure to focus attention, yet not so much structure to restrict options or discourage creative solutions. Such an approach is especially important in tackling a challenge that will require a high level of active collaboration with many diverse stakeholders …” The Plan Appendix page 232
If we agree that digital preservation is urgent, we should ask what progress can be made in two years. Failure to make substantial, publicly visible progress within five years of the beginning of NDIIPP funding would expose the Library of Congress to serious criticism.[22]
To revive a project that, from the outside, seems to be stalled in consultations yielding little new insight, DDQ recommends at least the following prompt actions vis-à-vis the technical challenges.
· List the “possible futures” in order of decreasing likelihood in order to enable checking that the technical architecture is robust in the face of every imaginable vicissitude.
· Publish a statement of technical requirements for preservation and access technology that would be used by the Library of Congress itself.[23]
· Identify the specific shortfalls of published know-how and available software offerings. What does NDIIPP require beyond what is available or at least known?
· Launch technical research as was promised in The Plan in February, calling for timely answers to the questions in the prior subsection.
Many reports call for economics research that will influence how preservation is accomplished. However, they do not clearly specify what information is wanted, or how soon this is needed.
Estimates might include: how much would it cost for a professional cataloguer to create, for each holding, the kind of metadata that the proposed METS standard calls for? How many digital objects must be preserved annually in each discipline to achieve reassuring coverage? What “feeds and speeds” will be needed for automatic WWW crawlers like the Internet Archive, and what will the annual cost of such a crawler be? Between 10 and 30 such parameters are likely to be sufficient for the major strategic and technical decisions.[24] Rough estimates are likely to suggest that some options are much better than others, helping focus both research and operational planning.
Why are such questions research issues? [25] Missing is justification that research will provide better guidance than quick, rough estimates.[26] We need to ask, “What can be accomplished before careful estimates are available, given that the research called for might take 2 to 5 years?”
In 1995, I was part of a small task force charged with recommending IBM’s business approach to digital library opportunities; the recommendation was to be presented to Lou Gerstner, the IBM Chairman.
Such recommendations usually require revenue estimates. We found digital library market size impossible to gauge because it had no precedents. However we knew that IBM market entry made sense, because multimedia data would require much more storage hardware[27] than record-oriented data. The issues were more about timing and manner of creating new business than whether IBM would try. After discussing the problem for several days, we decided to omit marketplace projections.
Before the recommendation advanced to the Chairman’s office, other executives scrutinized it. One advised, in forceful terms, that we should not present a corporate recommendation without marketplace estimates, but he had no advice how to make them.
So we improvised, reasoning that Mr. Gerstner understood research libraries well, as he served on the New York Public Library Board and was managing fund raising for its then-projected science and industry branch. We would not have been invited to recommend unless the potential was at least $1B p.a. Nobody would have believed an estimate greater than $10B p.a. So we guessed $3B, and divided this among product classes proportionally to the pattern for database management products. This nonsense filled about 1 page of our 10-page report, and satisfied the critical executives.[28]
Mr. Gerstner required a written recommendation about a week before a discussion meeting.[29] He entered the meeting carrying a copy of our report; when he opened it, we could see copious red ink. After commenting on two lesser matters, he continued, “… and on page 7 you make business projections. I don’t see how anybody can make projections for a business area that does not yet exist!” He then ceremoniously crossed out the offending page, and emphasized, “We’re going to enter this business because it is the right thing to do!”
IBM did that. Digital library was almost IBM’s only new development investment in 1994-5, a period in which prior difficulties forced 50% reduction of the IBM workforce.[30] Today IBM Content Manager™ is a successful offering that is gradually being merged with IBM’s DB2™ database management offering.[31]
Prof. Jerry Saltzer commented on our D-Lib article, What Do We Mean by Authentic?, suggesting two improvements with which we agree:
“On authenticity of natural entities, there is [a] case that you didn't consider: a 400-year old wooden boat. A property of wooden boats is that, over time, every piece of wood eventually must be replaced. If the maintenance is done authentically, the replacer uses the same kind of wood and cuts it to the same specifications as the original. Some old boats are authentic, others are not. …
“I am skeptical about the use of the term ‘provenance’ as [you define it]. … The art historian's ‘provenance’ is the list of owners of the object …. [For] an unsigned painting, whether or not that painting is declared authentic depends partly on the existence of a complete (and authenticable) provenance. …
“… you use provenance primarily in the sense of origin, with the addition of keeping track of who might have made derivative versions. I would recommend using the word ‘origin’ for that concept, and reserving the word ‘provenance’ for the various intermediate handlers and transmitters of the signals.”
I have collected more than a thousand citations of work related to digital preservation.[32] Many of these citations include authors’ abstracts. This bibliography is available on request.[33]
Publication in refereed periodicals is slow relative to modern expectations and limited in the kinds of material supported. E-print archives support today’s R&D pace by rapid dissemination. The community that believes digital preservation to be urgent will welcome the appearance of ERPAePRINTS:
“The ERPAePRINTS Service is an Open Archive set-up for the Electronic Resource Preservation and Access Network (ERPANET) in conjunction with DAEDALUS to provide an ePrints preservation and access facility for the cultural and scientific heritage community.”
“We are living in a digital world. Computers now far outnumber office workers in many parts of the globe. We bank by phone, enjoy digitally mastered music, fax carry-out orders, and communicate with each other through keyboarded thoughts. One of the sure signs that the global village has a digital face is the high investment of money and competitive energy now being directed toward changing the Internet into the National Information Infrastructure. After only a few years of life, the World Wide Web is crowded with time-sensitive data, news summaries, chat, and multimedia entertainment. The electronic landscape changes so rapidly—and the lines between the old and the new seem drawn so sharply—that Wired magazine can refer to a four-year-old network service provider as a "dinosaur," and get this retort: "It's very funny that a petroleum-based product like a magazine can call an online service that has an integrated Web browser irrelevant." Nollinger[34]
“A BBC World Service Global Business program focusing on on digital archiving will be broadcast world-wide and can also be listened to/downloaded from the WWW. It includes interviews with the BBC staff (film and sound archives), Glaxo Smith Kline (pharmaceuticals), Standard Life (insurance), NM Rothschild (Banking), and the Digital Archiving Consultancy. The broadcast lasts for approx 25 minutes and covers both drivers for and impediments to digital archiving in industry.” Neil Beagrie, 28th September
On June 29, Thomas Friedman, a New York Times columnist, wrote:
“Since 9/11 … one senses that many Americans are emotionally withdrawing from the world and that the world is drifting away from America. The powerful sense of integration …, the sense that the world was shrinking … to a size small, feels over now.
“The reality, though, is quite different. …, not only has the process of technological integration continued, it has actually intensified—and this will have profound implications. I recently [visited] the offices of Google … It is a mind-bending experience. You can actually sit in front of a monitor and watch a sample of everything that everyone in the world is searching for. …
“In the past three years, Google has gone from processing 100 million … to over 200 million searches per day. … only one-third come from inside the U.S. The rest are in 88 other languages. "The rate of the adoption of the Internet … is increasing, not decreasing," says Eric Schmidt, Google's C.E.O. …
“Says [an executive of] a new Wi-Fi provider: "If I can operate Google, I can find anything. And with wireless, … I will be able to find anything, anywhere, anytime. [That’s] why I say that Google, combined with Wi-Fi, is a little bit like God. God is wireless, God is everywhere and God sees and knows everything. … with one little Internet connection I can download anything from anywhere and I can spread anything from anywhere. That is good news for both scientists and terrorists, pro-Americans and anti-Americans.
“And that brings me to the point …: While we may be emotionally distancing ourselves from the world, the world is getting more integrated. … what people think of us, as Americans, will matter more, not less.”
Remember how the Web was going to bypass the poor? A 22nd August opinion suggested that it didn't, because “Access is there, awaiting the guidance—and desire—to use it.” In contrast, an August article suggests that “the simple binary notion of technology haves and have-nots doesn’t quite compute.” [35]
A 10th July Manchester Guardian article reported, “Paper records of births, deaths and marriages—the legal bedrock of individual identity—are to be phased out in England and Wales. Cradle-to-grave records will be stored on a new database—and the only proof of who you are will be digital.” It continued with, “It is not something the government wants to trumpet.”
The article quotes critics, including a representative of the British Library, who reminds the public, “At present, there's no way of guaranteeing continued access to and preservation of the digital version.”
The August number of Business 2.0 reports, “In just five years, the DVD has become the film industry’s biggest star.” 2003 DVD sales revenues are expected to exceed $11B; rental revenues are expected to exceed $5B. In contrast, box office ticket revenues will be about $10B.
DDQ readers will surely be aware of the lawsuits surrounding Linux™ and Unix™ offerings. DDQ offers interesting selections that you might have missed. On 7th July, InfoWorld’s Tom Yager wrote,
“SCO may indeed
have a story to tell, but its chosen means of telling it is egregiously bad form. If IBM actually allowed System V
code to leak into other operating systems, SCO would only need to identify the Ieaks. They would be removed overnight,
and their removal would be accompanied by apologies and a check
covering realistic damages. That appears to be what happened when UnixSystem Labs teamed with Novell to take the University of California, Berkeley to court, claiming that System V leaked into BSD Unix. USL/Novell proved three instances of
leakage, which were promptly plugged.
When it was Berkeley's turn at the podium, it identified mountains of … BSD
code that was stripped of BSD's copyright
text and pasted into System V.
Oops! The plaintiffs
quickly settled.”
On 5th
August, BusinessWeek reported a Red Hat suit “charging
SCO with conducting an ‘untrue and deceptive campaign’ designed to sabotage the
market for the Linux operating system” and SCO’s retort that it isn't
"trying to spread fear, uncertainty, and doubt to end users." Instead, it has been ‘educating’ them on the
risks of running Linux”—unconventionally forceful education, it seems to me! [36]
The 22nd September InfoWorld analysis of potential litigation outcomes might confuse readers by mixing patent law with copyright law. SCO vs. IBM alleges copyright infringement. InfoWorld’s hand wringing includes worries that IBM might use its patent position to suppress open-source software. That seems a long stretch, given the factual history to date.
Some people cling to the myth that scientific inquiry is a dispassionate search for orderly facts about the world. Michael White[37], David Salsburg[38], and James Gleick[39] provide contrary evidence:
“… rivalry continues to be the great motivator behind many scientific and technological advances. Scientists have come into conflict with their peers, governments, and … the church. White focuses on eight infamous scientific disputes that were catalyzed by personal, national, and industrial forces. Isaac Newton’s clashes with Robert Hooke resulted in Newton’s refusal to publish optical work for 30 years. The great physicist also had a fiery dispute with Gottfried Leibniz over who discovered calculus. … Other scientific arguments … existed between Charles Darwin and Richard Owen, Nikola Tesla and Thomas Edison, …”
Salsburg’s The Lady Tasting Tea describes controversies about statistics research before that became a recognized discipline. For instance, the reason that we today know a Student’s t-distribution is that its inventor, William Sealy Gossett, used “Student” as a nom-de-plume to protect his employment by the Guinness Brewing Company.
Gleick’s Chaos illustrates how the establishment has sometimes treated radical departure from narrow disciplinary orthodoxy before the new wisdom has completed its most interesting work, pointing out how closely this behavior is associated with the need to filter out poor work.
“[Thomas Kuhn] deflated the view of science as an orderly process of asking questions and finding their answers. He emphasized a contrast between the bulk of what scientists do, working on legitimate, well-understood problems within their disciplines, and the exceptional, unorthodox work that creates revolutions. Not by accident, he made scientists seem less than perfect rationalists.
“In Kuhn's scheme, normal science consists largely of mopping-up operations. Experimentalists carry out modified versions of experiments that have been carried out many times before. Theorists add a brick here, reshape a cornice there, in a wall of theory. It could hardly be otherwise. If all scientists had to begin from the beginning, questioning fundamental assumptions, they would be hard pressed to reach the level of technical sophistication necessary to do useful work. … a twentieth-century fluid dynamicist could hardly expect to advance knowledge in his field without first adopting a body of terminology and mathematical technique. In return, unconsciously, he would give up much freedom to question the foundations of his science.” Chaos, page 35
We encounter people who, arguing for "traditional" rigor, insist that for music and the arts we must apply authenticity criteria and methods that evolved to combat duplicity in diplomacy and finance. Raymond Leppard, the well-known British conductor, eloquently exposes how ridiculous this position can be.[40]
"The nineteenth century, in its preoccupation with man's upward progress, saw compromise as a blemish upon possible perfection and became ashamed of it. It was put aside as if, like original sin, it were best ignored, pretending, if it showed, that it didn't exist. All cults, religious and political as well as musical, tend to reject compromise as an unacceptable failing that mars the ideal, diminishes the particularity and weakens the message. It is the root cause of the fundamental unworkability of socialism, many of whose ideals are quite unexceptionable. Churchill is said once to have advised a fellow politician never to abandon his ideals but, equally, never to try to put them into practice.
"