Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 2, Number 3, 3Q2003

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

20044 Glen Brae Drive

Saratoga, CA 95070

©  2003, H.M. Gladney

 

Digital Preservation

The cultural heritage community has characterized digital preservation as “urgent”, without identifying what technology is missing.   We had hoped the Plan for the National Digital Information Infrastructure and Preservation Program (referred to as “The Plan” below) would fill this gap.  Unfortunately, it does not.

The NDIIPP Plan: What’s Missing?

“Digital technology … has spawned a surfeit of information that is extremely fragile, inherently impermanent, and difficult to assess for long-term value.    it is increasingly difficult for libraries to identify what is of value, to acquire it, and to ensure its longevity over time.

“Never has access to information that is authentic, reliable, and complete been more important, and never has the capacity of … heritage institutions to guarantee that access been in greater jeopardy.  Recognizing the value that the preservation of past knowledge has played …, the U.S. Congress seeks … solutions to the challenges [of] … preserving digital information of cultural and social significance.” The Plan page 1

To a software engineer intending to contribute to digital preservation, the February 2003 Plan for the National Digital Information Infrastructure and Preservation Program is perplexing.  Its generalities and ambiguities make it difficult to extract what engineers expect to find in a plan.[1]

We expect a plan to articulate concisely each objective, the resources needed to meet it, commitments to specific actions, a schedule for each technology or service delivery, and a prescription for measuring outcomes and quality.[2]  If the plan is for a large project, we expect it to be expressed in portions that separate teams can address relatively independently, and that a plan document exists for each team.[3]  We further expect concise descriptions of the environment—business and social circumstances that the participants cannot substantially change.  If an environmental factor is adverse, we expect the plan to indicate how the team will bypass or mitigate its effects.  If the resources currently available are inadequate, we expect the plan to identify each shortfall.  Finally, if the team has already worked on the topic, we expect its plan to list its prior achievements.

Engineers want questions that can be answered objectively by testable facts.  They expect documents clear enough so that every participant and every qualified observer can understand what is committed and what work is not authorized, and can judge whether committed progress is being achieved.

However, The Plan identifies few technical specifics, no target dates, and few objective success measures.  This is troubling for an initiative launched almost three years ago.

What’s “Urgent” for the NDIIPP?

“… these problems are urgent; … action is needed now, not some time in the future.”   The Plan page 3

“… the best strategy is to get into the learning loop as quickly … as possible.  While it is impossible to know now what approach will be best, it is very realistic to make step-wise … progress …
                                                                                                                        The Plan Appendix page 230

The research library community has long voiced such expressions of urgency.  We would like to know:

·        What specifically is meant by “urgent”?  What unrecoverable damage is being caused by the seemingly slow pace? [4]

·        What technical needs does The Plan express that were unknown in 1996? [5]

·        How can candidate content be graded from ‘can be handled well today’ to ‘not yet tractable’?

·        How soon will the LoC “develop the components of the preservation architecture(The Plan page 7)?

What Has Been Achieved so Far?

“To begin building the [National Digital] preservation infrastructure, the Library proposes a strategy for working on developing a network of participants and building the technical framework.”  The Plan page 5

This document reports … what has been learned from a variety of activities [and] proposes … actions … to begin practical applications and modeling … implementation of NDIIPP.”              The Plan page 11

The Plan promises that “current information on the program’s status” will be posted on its Web page.  However, the NDIIPP team has provided little information since The Plan was published in February 2003.  Readers would be interested in answers to the following questions:

·        What has LoC accomplished towards meeting 2000 National Academies’ recommendations for improving its digital skills, resources, and internal infrastructure? [6]

·        How will LoC respond to U.S. GAO guidelines for preserving digital records? [7]

·        A National Academies committee has recommended actions for NARA’s Electronic Records Project.[8]  What parts of this recommendation apply in principle, and how does LoC plan to address them?

How Much Consensus is Needed?

“The vision of NDIIPP is to ensure the access over time to a rich body of digital content through the establishment of a national network of committed partners, collaborating in a digital preservation architecture with defined roles and responsibilities.”                                                          The Plan page 5

“The digital preservation infrastructure will be characterized by a complex network of relationships and dependencies … unknown in the world of print and analog … resources.” [9]                  The Plan page 17

“… there is a strong need to [clarify] roles, [to] offer flexibility, and to provide focusing … for institutions to [choose] if and how … to participate in … digital preservation.”                    The Plan Appendix page 230

The Plan repeatedly emphasizes consensus.  The library community has excellent formal and informal collaborative structure—as strong as that of any professional community.  We see pervasive tension between achieving consensus and helping each institution address its own priorities.

·        Much consultation and consensus building occurs informally and through organizations such as ALA, DLF, IFLA, OCLC, RLG, SLA, and Internet and WWW standardization bodies.  What’s missing?

·        What tools and rules would enable each institutional or individual contributor to participate in the emerging information infrastructure with no more than minimal administrative overhead?

·        Which questions and aspects of The Plan require consensus on standard practices?  Which questions concern only methodology within each independent institution? [10]

These questions can, in part, be reframed as:

·        What is a sufficient set of standards for digital information interchange and interoperability?  How must we augment or extend existing standards?

What Can Be Said About the NDIIPP Preliminary Architecture?

“[NDIIPP plan] goals called for the following planning steps, all accomplished in the past 18 months:  … developing a digital preservation architecture that establishes critical consensus on technical approaches.”
                                                                                                                                           The Plan page 14

“What most distinguishes the digital preservation context from the analog one now in place … is the sheer scale of it.  It comprehends vastly larger amounts of information, … distributed in new venues to a larger and more heterogeneous user base.”                                                                    The Plan page 17

“[In] the American Memory program, the Library of Congress led an effort to digitize more than 100 historical collections … [Its] more than 7 million items … are used daily by teachers, students, scholars, genealogists, [and] private citizens. The Digital Library Initiatives, sponsored by the [NSF], [DARPA], the [NLM], LoC, [NASA], and [NEH], fostered [R&D] for hundreds of digital libraries … [that] have evolved into critical research resources ….  These resources need to be maintained … to protect several hundred millions of dollars invested to digitize, organize, and provide access.”          The Plan Appendix page 209

The Plan is almost silent about what was learned by these Government investments.[11]  It is silent about current content management[12] packages, even the open source packages favored by academia.  Nor does it differentiate needed preservation functionality from what digital library technology has provided for five years and longer.  What is the architecture for which consensus has been achieved? [13]

The Plan Appendix 9 reads like the start of a requirements statement.  Inspection of current digital library offerings would reveal that most, if not all, have the layering it calls for.  In fact, more elaborate layering is desirable to shield customers from disruption caused by unpredictable changes, and to allow a repository or an end user to integrate technology from competing vendors.  Figure 1 suggests architecture implemented in readily available offerings.

Figure 1: Technology layering in "industrial strength" content management offerings:

a solid line between layers depicts a standard interface; a fuzzy boundary () indicates a proprietary interface that might force the customer to choose both components from a single vendor.  A dashed line () through a component indicates available distributed implementations, whose client portions and server portions might be at different geographic locations and execute on different computing platforms.

·        What does LoC require beyond what available “industrial strength” software already provides? [14]

·        What digital preservation challenges are not addressed by the current digital library literature? [15]

·        What is the NDIIPP plan to respond to “the sheer scale of it”?  What can be learned about the architecture by talking to managers of the largest digital collections in service today? [16]

·        Have the NDIIPP participants estimated how many documents are needed to make each kind of collection useful, how much the accession by the best current methods will cost, and the budgetary impacts and research priority implications of such estimates?

What Are the Technical Preservation Issues?

Technology informs almost every aspect of long-term preservation.  It is not widely believed that there will be a single solution or that solutions can be achieved solely through technological means.  Technological complexities vary across formats, but there is consensus around [challenges listed in the appendix]. It is also important to begin working with material, both to capture valuable but highly ephemeral items and to test possible technical solutions.                                                                                      The Plan Appendix 1 page 4

The Plan is written almost as if a digital information infrastructure did not already exist.  In fact, what’s available to research workers is an amazing improvement over what they worked with a decade ago.[17]  What seems to be missing is inexpensive preservation methodology.

To design a comprehensive infrastructure for information preservation, we must consider the entire communication channel from each information producer to each eventual consumer, asking:

·        How can today’s authors and editors ensure that eventual consumers can interpret information saved today, or otherwise use it as intended?

·        What provenance and authenticity information will eventual information consumers find useful?

·        How can we make authenticity evidence sufficiently reliable, even for sensitive documents? [18]

·        How can we make the repository network robust, i.e., insensitive to failures and proof against the loss of the pattern that represents any particular information object? [19]

·        How can we minimize the library accession cost of each digital library holding?

·        How can we motivate authors and editors to provide descriptive and evidentiary metadata as a by-product of their efforts, thereby shifting effort and cost from repository institutions? [20]

These questions focus on end-to-end relationships between each information producer and each eventual consumer, rather than on the design of repositories.  Such questions might have appeared in The Plan, had it considered end user needs more than it does.  Possible responses include whatever The Plan might evoke and also possibilities unlikely to emerge from the current NDIIPP plan process.

It might surprise the reader that good responses are known for all these questions.[21]  These answers are mostly not yet validated, and mostly unknown to the library community, much less accepted by that community.  Nor is it clear that they are optimal.  However, the peer examination and testing needed would be fairly straightforward tasks.  The NDIIPP plan should challenge the research community to devise and “sell” solutions better than those already known.

What Prompt Action is Possible?

“[T]here is no clear solution or set of solutions to meet the challenges of digital preservation.   The unpredictability of technological development, … and [of] the global political environment … contribute to the challenge of plotting a course in the face of a wide range of possible futures.

“To that end, the Library undertook … to identify collectively the key driving forces and variables in the foreseeable future … to prepare possible futures for the Library …”                               The Plan page 19

“Most importantly, the [NDIIPP planning] process provided … just enough structure to focus attention, yet not so much structure to restrict options or discourage creative solutions.  Such an approach is especially important in tackling a challenge that will require a high level of active collaboration with many diverse stakeholders …”                                                                                 The Plan Appendix page 232

If we agree that digital preservation is urgent, we should ask what progress can be made in two years.  Failure to make substantial, publicly visible progress within five years of the beginning of NDIIPP funding would expose the Library of Congress to serious criticism.[22]

To revive a project that, from the outside, seems to be stalled in consultations yielding little new insight, DDQ recommends at least the following prompt actions vis-à-vis the technical challenges.

·        List the “possible futures” in order of decreasing likelihood in order to enable checking that the technical architecture is robust in the face of every imaginable vicissitude.

·        Publish a statement of technical requirements for preservation and access technology that would be used by the Library of Congress itself.[23]

·        Identify the specific shortfalls of published know-how and available software offerings.  What does NDIIPP require beyond what is available or at least known?

·        Launch technical research as was promised in The Plan in February, calling for timely answers to the questions in the prior subsection.

Not Everything Unknown Is a Research Issue

Many reports call for economics research that will influence how preservation is accomplished.  However, they do not clearly specify what information is wanted, or how soon this is needed.

Estimates might include: how much would it cost for a professional cataloguer to create, for each holding, the kind of metadata that the proposed METS standard calls for?  How many digital objects must be preserved annually in each discipline to achieve reassuring coverage?  What “feeds and speeds” will be needed for automatic WWW crawlers like the Internet Archive, and what will the annual cost of such a crawler be?  Between 10 and 30 such parameters are likely to be sufficient for the major strategic and technical decisions.[24]  Rough estimates are likely to suggest that some options are much better than others, helping focus both research and operational planning.

Why are such questions research issues? [25]  Missing is justification that research will provide better guidance than quick, rough estimates.[26]  We need to ask, “What can be accomplished before careful estimates are available, given that the research called for might take 2 to 5 years?”

Allegory: Estimating for a New Business

In 1995, I was part of a small task force charged with recommending IBM’s business approach to digital library opportunities; the recommendation was to be presented to Lou Gerstner, the IBM Chairman. 

Such recommendations usually require revenue estimates.  We found digital library market size impossible to gauge because it had no precedents.  However we knew that IBM market entry made sense, because multimedia data would require much more storage hardware[27] than record-oriented data.  The issues were more about timing and manner of creating new business than whether IBM would try.  After discussing the problem for several days, we decided to omit marketplace projections.

Before the recommendation advanced to the Chairman’s office, other executives scrutinized it.  One advised, in forceful terms, that we should not present a corporate recommendation without marketplace estimates, but he had no advice how to make them.

So we improvised, reasoning that Mr. Gerstner understood research libraries well, as he served on the New York Public Library Board and was managing fund raising for its then-projected science and industry branch.  We would not have been invited to recommend unless the potential was at least $1B p.a.  Nobody would have believed an estimate greater than $10B p.a.  So we guessed $3B, and divided this among product classes proportionally to the pattern for database management products.  This nonsense filled about 1 page of our 10-page report, and satisfied the critical executives.[28]

Mr. Gerstner required a written recommendation about a week before a discussion meeting.[29] He entered the meeting carrying a copy of our report; when he opened it, we could see copious red ink.  After commenting on two lesser matters, he continued, “… and on page 7 you make business projections.  I don’t see how anybody can make projections for a business area that does not yet exist!”  He then ceremoniously crossed out the offending page, and emphasized, “We’re going to enter this business because it is the right thing to do!”

IBM did that.  Digital library was almost IBM’s only new development investment in 1994-5, a period in which prior difficulties forced 50% reduction of the IBM workforce.[30]  Today IBM Content Manager™ is a successful offering that is gradually being merged with IBM’s DB2™ database management offering.[31]

More on “Authenticity”

Prof. Jerry Saltzer commented on our D-Lib article, What Do We Mean by Authentic?, suggesting two improvements with which we agree:

“On authenticity of natural entities, there is [a] case that you didn't consider:  a 400-year old wooden boat.  A property of wooden boats is that, over time, every piece of wood eventually must be replaced.  If the maintenance is done authentically, the replacer uses the same kind of wood and cuts it to the same specifications as the original.  Some old boats are authentic, others are not. 

“I am skeptical about the use of the term ‘provenance’ as [you define it].    The art historian's ‘provenance’ is the list of owners of the object ….  [For] an unsigned painting, whether or not that painting is declared authentic depends partly on the existence of a complete (and authenticable) provenance. 

“… you use provenance primarily in the sense of origin, with the addition of keeping track of who might have made derivative versions.  I would recommend using the word ‘origin’ for that concept, and reserving the word ‘provenance’ for the various intermediate handlers and transmitters of the signals.”

Digital Library and Preservation Bibliography

I have collected more than a thousand citations of work related to digital preservation.[32]  Many of these citations include authors’ abstracts.  This bibliography is available on request.[33]

E-print Service Emphasizing Preservation

Publication in refereed periodicals is slow relative to modern expectations and limited in the kinds of material supported.  E-print archives support today’s R&D pace by rapid dissemination.  The community that believes digital preservation to be urgent will welcome the appearance of ERPAePRINTS:

“The ERPAePRINTS Service is an Open Archive set-up for the Electronic Resource Preservation and Access Network (ERPANET) in conjunction with DAEDALUS to provide an ePrints preservation and access facility for the cultural and scientific heritage community.”

News Reports

Timely 1995 News: “America Online” and the Information Infrastructure

“We are living in a digital world.  Computers now far outnumber office workers in many parts of the globe. We bank by phone, enjoy digitally mastered music, fax carry-out orders, and communicate with each other through keyboarded thoughts.  One of the sure signs that the global village has a digital face is the high investment of money and competitive energy now being directed toward changing the Internet into the National Information Infrastructure.  After only a few years of life, the World Wide Web is crowded with time-sensitive data, news summaries, chat, and multimedia entertainment.  The electronic landscape changes so rapidly—and the lines between the old and the new seem drawn so sharply—that Wired magazine can refer to a four-year-old network service provider as a "dinosaur," and get this retort: "It's very funny that a petroleum-based product like a magazine can call an online service that has an integrated Web browser irrelevant."                                                                                             Nollinger[34]

Late Breaking 2003 News: BBC Broadcast on Digital Archiving

“A BBC World Service Global Business program focusing on on digital archiving will be broadcast world-wide and can also be listened to/downloaded from the WWW.  It includes interviews with the BBC staff (film and sound archives), Glaxo Smith Kline (pharmaceuticals), Standard Life (insurance), NM Rothschild (Banking), and the Digital Archiving Consultancy.  The broadcast lasts for approx 25 minutes and covers both drivers for and impediments to digital archiving in industry.” Neil Beagrie, 28th September

Is Google God?

On June 29, Thomas Friedman, a New York Times columnist, wrote:

“Since 9/11 … one senses that many Americans are emotionally withdrawing from the world and that the world is drifting away from America.  The powerful sense of integration …, the sense that the world was shrinking … to a size small, feels over now.

“The reality, though, is quite different.  …, not only has the process of technological integration continued, it has actually intensified—and this will have profound implications. I recently [visited] the offices of Google …  It is a mind-bending experience. You can actually sit in front of a monitor and watch a sample of everything that everyone in the world is searching for. 

“In the past three years, Google has gone from processing 100 million … to over 200 million searches per day.  … only one-third come from inside the U.S.  The rest are in 88 other languages. "The rate of the adoption of the Internet … is increasing, not decreasing," says Eric Schmidt, Google's C.E.O. 

“Says [an executive of] a new Wi-Fi provider: "If I can operate Google, I can find anything.  And with wireless, … I will be able to find anything, anywhere, anytime.  [That’s] why I say that Google, combined with Wi-Fi, is a little bit like God.  God is wireless, God is everywhere and God sees and knows everything.    with one little Internet connection I can download anything from anywhere and I can spread anything from anywhere. That is good news for both scientists and terrorists, pro-Americans and anti-Americans.

“And that brings me to the point …: While we may be emotionally distancing ourselves from the world, the world is getting more integrated.  … what people think of us, as Americans, will matter more, not less.”

The Digital Divide That Wasn’t

Remember how the Web was going to bypass the poor?  A 22nd August opinion suggested that it didn't, because “Access is there, awaiting the guidance—and desire—to use it.”  In contrast, an August article suggests that “the simple binary notion of technology haves and have-nots doesn’t quite compute.” [35]

Digital Citizens in the U.K.

A 10th July Manchester Guardian article reported, “Paper records of births, deaths and marriages—the legal bedrock of individual identity—are to be phased out in England and Wales.  Cradle-to-grave records will be stored on a new database—and the only proof of who you are will be digital.”  It continued with, “It is not something the government wants to trumpet.”

The article quotes critics, including a representative of the British Library, who reminds the public, “At present, there's no way of guaranteeing continued access to and preservation of the digital version.”

Video DVD Revenues

The August number of Business 2.0 reports, “In just five years, the DVD has become the film industry’s biggest star.”  2003 DVD sales revenues are expected to exceed $11B; rental revenues are expected to exceed $5B.  In contrast, box office ticket revenues will be about $10B.

SCO Linux Lawsuits

DDQ readers will surely be aware of the lawsuits surrounding Linux and Unix offerings.  DDQ offers interesting selections that you might have missed.  On 7th July, InfoWorld’s Tom Yager wrote,

“SCO may indeed have a story to tell, but its chosen means of telling it is egregiously bad form.  If IBM actually allowed System V code to leak into other operating systems, SCO would only need to identify the Ieaks.  They would be removed overnight, and their removal would be accompa­nied by apologies and a check covering realistic damages.  That appears to be what happened when UnixSystem Labs teamed with Novell to take the University of California, Berkeley to court, claiming that System V leaked into BSD Unix.  USL/Nov­ell proved three instances of leakage, which were promptly plugged.  When it was Berkeley's turn at the podium, it identified mountains of … BSD code that was stripped of BSD's copy­right text and pasted into System V.  Oops!  The plaintiffs quickly settled.”

On 5th August, BusinessWeek reported a Red Hat suit “charging SCO with conducting an ‘untrue and deceptive campaign’ designed to sabotage the market for the Linux operating system” and SCO’s retort that it isn't "trying to spread fear, uncertainty, and doubt to end users."  Instead, it has been ‘educating’ them on the risks of running Linux”—unconventionally forceful education, it seems to me! [36]

The 22nd September InfoWorld analysis of potential litigation outcomes might confuse readers by mixing patent law with copyright law.  SCO vs. IBM alleges copyright infringement.  InfoWorld’s hand wringing includes worries that IBM might use its patent position to suppress open-source software.  That seems a long stretch, given the factual history to date.

Reading Recommendations

Dispassionate Science and Disciplinary Orthodoxy

Some people cling to the myth that scientific inquiry is a dispassionate search for orderly facts about the world.  Michael White[37], David Salsburg[38], and James Gleick[39] provide contrary evidence:

“… rivalry continues to be the great motivator behind many scientific and technological advances. Scientists have come into conflict with their peers, governments, and … the church.  White focuses on eight infamous scientific disputes that were catalyzed by personal, national, and industrial forces.  Isaac Newton’s clashes with Robert Hooke resulted in Newton’s refusal to publish optical work for 30 years.  The great physicist also had a fiery dispute with Gottfried Leibniz over who discovered calculus.    Other scientific arguments … existed between Charles Darwin and Richard Owen, Nikola Tesla and Thomas Edison, …”                                                                                                                                     

Salsburg’s The Lady Tasting Tea describes controversies about statistics research before that became a recognized discipline.  For instance, the reason that we today know a Student’s t-distribution is that its inventor, William Sealy Gossett, used “Student” as a nom-de-plume to protect his employment by the Guinness Brewing Company.

Gleick’s Chaos illustrates how the establishment has sometimes treated radical departure from narrow disciplinary orthodoxy before the new wisdom has completed its most interesting work, pointing out how closely this behavior is associated with the need to filter out poor work.

“[Thomas Kuhn] deflated the view of science as an orderly process of asking questions and finding their answers.  He emphasized a contrast between the bulk of what scientists do, working on legitimate, well-understood problems within their disciplines, and the exceptional, unorthodox work that creates revolutions. Not by accident, he made scientists seem less than perfect rationalists.

“In Kuhn's scheme, normal science consists largely of mopping-up operations.  Experimentalists carry out modified versions of experiments that have been carried out many times before.  Theorists add a brick here, reshape a cornice there, in a wall of theory. It could hardly be otherwise.  If all scientists had to begin from the beginning, questioning fundamental assumptions, they would be hard pressed to reach the level of technical sophistication necessary to do useful work.  a twentieth-century fluid dynamicist could hardly expect to advance knowledge in his field without first adopting a body of terminology and mathematical technique.  In return, unconsciously, he would give up much freedom to question the foundations of his science.”                                                                                                                                     Chaos, page 35

Raymond Leppard: Authenticity in Music

We encounter people who, arguing for "traditional" rigor, insist that for music and the arts we must apply authenticity criteria and methods that evolved to combat duplicity in diplomacy and finance.  Raymond Leppard, the well-known British conductor, eloquently exposes how ridiculous this position can be.[40]

"The nineteenth century, in its preoccupation with man's upward progress, saw compromise as a blemish upon possible perfection and became ashamed of it.  It was put aside as if, like original sin, it were best ignored, pretending, if it showed, that it didn't exist.  All cults, religious and political as well as musical, tend to reject compromise as an unacceptable failing that mars the ideal, diminishes the particularity and weakens the message.  It is the root cause of the fundamental unworkability of socialism, many of whose ideals are quite unexceptionable.  Churchill is said once to have advised a fellow politician never to abandon his ideals but, equally, never to try to put them into practice.

"Of course, it would have been the most amazing revelation to have heard [Bach's cantata] Wachet auf in Leipzig under Bach's direction on 25 November 1731, but no amount of wishing will make it happen."

Leppard’s Authenticity in Music begins by reminding us that after the deaths of its composers, 16th- to 18th-century music was ignored until roughly the time of the First World War, and that widespread access to performances began only when radio advanced from an idiosyncratic hobby to a popular medium in about 1930, and burgeoned only when 33-rpm vinyl recordings became inexpensive.

If we nevertheless insist, as many people do, on extremely high standards in authenticity, we will find that today's technology makes them feasible and even economical to achieve for modern performances.  However, rigid notions of what those standards should be—notions grounded in supposed long tradition—are neither warranted by the facts nor practical.

Practical Matters

Scenario and Reaction

We all experience frustration with computing malfunction, with technology limitations, and with programs whose descriptions and functions do not correspond.  I was reminded that we tend to overlook how recent personal digital technology is by a friend’s e-mail:

Subject: Many steps backward, many steps forward and I am back to where I was—months ago!

Some more reasons why I hate computers. This time it is [printer] switches ….

In May I replaced my [brand X] multi-functional unit (printer, scanner, fax, telephone) with a …  laser printer and a [more recent multi-function unit].  The [new hardware] cost $250 less than the old unit, and [promised] lower per copy costs for black and white …. To connect these two printers, the storekeeper [recommended] an inexpensive switch ($16).

So I hooked it up and had several months of grief [that included “expert”, but flawed, advice until a] Techie I spoke with put his finger on the problem. The switch wasn't bi-directional.  Now why didn't I think about that!  After another visit to the store (an hour's drive away), and negotiating a refund on the first switch ...  I hooked up my bi-directional switch (cost $62) and re-installed the scanner software. 

So, now that my computer is working again who/what do I blame for all life's various ills?  Me?

DDQ offers an analogy.  Notice that my friend’s computer continued to work throughout the period of his aggravation. 

(a)    Digital technology is much more complicated than automobile technology (at least than automobile technology before it included a lot of digital technology).

(b)    Automobiles became consumer items around 1925; PCs became consumer items around 1990.

(c)    In 1938 you might have decided to upgrade the electrical subsystem of your Model A Ford.  To replace it, you went to your local discount automobile parts store, and bought a new generator and starter motor (made by some off-shore manufacturer), together with installation manuals.

(d)    You brought these home to install.  (The minimum wage clerk at the discount parts store told you, "Anybody can install this stuff.  You’ll need only tools you already have at home.")

(e)    It took several days to discover that you needed more information.  That was delivered by snail-mail, after it several tedious telephone calls to locate the manufacturer's representative.

(f)      Even this did not work, because (as you now appreciate), you had not carefully compared the new device literature with the automobile manuals.  The polarity of the purchased stuff was opposite from what the Ford machine required.  You needed some further cable gizmo to set this right!  

(g)    All this time, your automobile was quite useless.  Even its radio was not available to provide you information from the outside world.

On the other side of the issue, I was reminded of the following satire.[41]  If computers were cars:

A particular model year of MicroCar wouldn't be available until AFTER that year, instead of before.

Every time they repainted the lines on the road, every MicroCar owner would have to buy a new one or be left behind.

Occasionally your MicroCar would die for no reason, and you'd have to restart it, not where it was, but back in its garage. For some strange reason, you would just accept this.

You could only have one person at a time in your MicroCar, unless you bought a MicroCar '95 or a MicroCar NT, but then you'd have to buy more seats.

Sun Motors would make a car that was solar powered, twice as reliable, 5 times as fast, but only ran on 5% of the roads.

Where other cars had oil, alternator, gas, and engine warning lights, your MicroCar would have a single "General Car Fault" warning light.

People would get excited about the "new" MicroCar features, forgetting that they had been available in other brands for years.

We'd all have to switch to MicroCar GasÔ.

New MicroCar seats would force everyone to have the same size butt.

When you got as far as your MicroCar would take you, you would find that it wasn't anywhere you wanted to be.

Internet Radio

It is perhaps of interest to those of you with full-time Internet connections (e.g., DSL) that, without additional expense, you can listen to international radio stations.  Good quality is available even over a 56kBaud telephone connection.

Being enthusiastic about classical music and liking to listen while I work, I currently link to WQXR, Deutschlandfunk, or Radio13 from Paris.  Any is accessible with four or five clicks (start the Windows Media Player, choose "Radio Tuner", select a station).  This program and its competitors seem to be merely specialized Web browsers; typically music services provide quick links to daily playlist and news Web pages.  To find a station is easy with the player’s search/browse interface.

Until recently I used the RealOne player, but dropped that shortly after they began to charge $10/month for service that is elsewhere available free of charge.  Why the Windows Media player?  Simply because it is the first service that I inspected after dropping RealOne, and provides everything I want just now.

SW Tools and Web Resources

The resources mentioned in DDQ are the most promising selections among hundreds inspected.  Except as otherwise noted, DDQ mentions only very inexpensive tools.

Standby

If you want a Microsoft Windows PC to save power while maintaining itself and all active programs in their current state, use its standby or hibernation mode.  In these modes, the machine is also locked and insensitive to Internet attacks, even if it shares a LAN with an active Internet gateway.

Maintain Perfect Time

Dimension 4 provides periodic system clock synchronization with a remote reference clock, such as those provided by U.S. NIST.  It can run hidden in the background, correcting your PC clock periodically.

Favorite Spam Filter

After reading many descriptions and trying three filters, I settled on SpamBayes, a free filter that "learns" as you reclassify incoming mail that was incorrectly sorted into ‘spam’, ‘ham’ and ‘not sure whether it's spam or ham’.  Classification is sensitive to personal preferences.  After two months of usage, I only rarely find a misclassification among the roughly 30 spams I receive daily.  In summary, SpamBayes works for me as well as the on-line descriptions and several PC magazines suggest.

A comment on installation is in order.  I use the Microsoft Outlook™ mail client; for that, and perhaps only for that, SpamBayes installs without any program preparation.[42] 

The specification of at least one commercial product reads similarly to that of SpamBayes.  I currently see no reason to buy a spam filter, but am keeping an open mind because WWW gossip suggests that spam volume is increasing rapidly and that miscreants are working to bypass the most effective filters.

Timely News Feeds

Many newspapers update RSS (Really Simple Syndication) Web content feeds several times daily.  Excellent RSS readers are available.  I like FeedDemon.  A review accurately reflects my perception.

Home Computing Technology and Price Watch

Prices observed[43] since DDQ 2(2) appeared include:

Disk drive (ATA)

Western Digital 200Gb Ultra DMA/100 8.9 ms. 7200 rpm

$0.54

per Gbyte

Disk drive (external)

Maxtor 120Gb

$1.08

per Gbyte

Disk drive (SATA)

Seagate 160Gb ATA/150, 7200 rpm

$0.90

per Gbyte

SATA/PCI adapter

generic

$40

each

USB mobile drive

64 Mbyte

$11

each

Acknowledgements

Once again, it is a pleasure to acknowledge that discussions with John Bennett, Tom Gladney, Raymond Lorie, and John Swinden were extremely helpful towards creating this DDQ number.



[1]     Although many aspects of digital preservation have received attention since the mid-1990s, most of the presentations and papers on the subject have ended with little more than general comments about the complexity and expense of the tasks, and ambiguity about responsibilities and roles.”  Deanna Marcum in Research Questions for the Digital Era Library, Library Trends 51(4), 636-651, Spring 2003.

      The problem begins with ambiguous language.  Even the term “information infrastructure” needs to be defined, and perhaps partitioned into distinct concepts.  (See Christine Borgman, The invisible library: Paradox of the global information infrastructure, Library Trends 51(4), 652-674, Spring 2003.)  A careful glossary would be helpful.

      DDQ focuses on questions that technical experts can be expected to address concretely in 1-2 years of work.  Although this omits key aspects, we expect technology to help address non-technical challenges. 

[2]     This might seem impossible in a concise document.  However, a plan document can incorporate know-how by citation.  For clarity the critical plan statements should not become obscured by potentially lengthy technical detail.

[3]     The style described was used for the annual IBM Research Division Plan from about 1980 to about 1995.  This collected individual project plans, which each represented between 5 and 40 staff members, and was between 2 and 4 pages long.  Whether a project plan was quite specific or relatively vague depended on the stage of the work, and could be expected to change from year to year.  Each project plan listed each objective, resource, action commitment, target date, and outcome measure for the project.  The managers who reviewed each project plan focused on its list of measurable result commitments.

      Writing such a project plan was easier and quicker than writing a government grant proposal.  Permission to expend the resources identified was formal acceptance of such a project plan by the responsible management.  The project plans were open to all IBM Research Staff Members, whose inquiries and critical comments were usually welcome.

[4]     “Scenario planning brought together … creators, publishers and distributors, digital librarians, computer scientists, archivists and librarians, to consider the impact that key driving forces may have in the future development of the digital preservation infrastructure. The resulting views into possible futures informed later thinking about how to develop the network of partners and technology components to enable digital preservation.”                                                                       The Plan page 4

      This exercise ended a year ago.  However The Plan says little about what its scenarios taught about technology components.

[5]     This document reports … on what has been learned from a variety of activities.”  (The Plan page 11)  What should be compared is J. Garrett et al., Preserving Digital Information: Report of the Task Force on Archiving of Digital Information, organized by The Commission on Preservation and Access and The Research Libraries Group, 1995-6.

[6]     Committee on an Information Technology Strategy for the Library of Congress, National Research Council, LC21: A Digital Strategy for the Library of Congress, July 2000.  Some LC21 recommendations specified target dates that have passed.

[7]     U.S. General Accounting Office, Information Management: Challenges in Managing and Preserving Electronic Records, GAO-02-586, June 2002.

[8]     Committee on Digital Archiving and the National Archives and Records Administration, Building an Electronic Records Archive at the National Archives and Records Administration (NARA): Recommendations for Initial Development, June 2003.

[9]     Some kinds of performance works are ephemeral and also hedged with administrative and legal measures that prevent a preservation agent from using optimal technology for collecting information that is not degraded from what is available in its data sources.  For instance, see The Plan pages 23-29.

[10]    The implementing technology must, obviously, be based on standards for information interchange and system interoperability.  Most of the software needed already exists and is being refined in support of business applications.

      Each institution’s collecting choices will continue to be relatively independent of other institutions’ choices.  Existing professional relationships provide sufficient co-ordination to avoid redundancy.  Effort to coordinate collection development further would be largely wasted.  See Gerald W. George, Difficult Choices: How Can Scholars Help Save Endangered Research Resources?  CLIR Report pub58, 1995.  ISBN 1-887334-43-2

[11]    These investments contributed to a large digital library literature. The bibliography announced below identifies more than 400 published articles touching on topics alluded to in The Plan.

[12]    “Content management” service is recent industry jargon for “digital library” service.

[13]    The Open Archival Information Systems (OAIS) Reference Model is primarily an ontology, rather than an architecture.  See A Bigger Problem Called “OAIS” in DDQ 1(1) and Use and Misuse of OAIS in DDQ 1(3).

[14]    IBM’s 1995 digital library requirements had several hundred statements, each specifying a feature set in sufficient detail to drive a development plan.  In the current situation, requirements specific to LoC would be both easier to write and more interesting to other institutions than requirements sufficiently general to cover what different institutions would want.  The latter kind of requirements statement is typically written by software vendors and refined over several years.

      The British Library has created digital infrastructure requirements documents, once with the assistance of the IBM U.K. Development Laboratory.    A more current and perhaps closer starting point might be NARA’s Electronic Records Archives Requirements Document, July 2003.

[15]    To distinguish digital preservation problems from digital library problems, suppose that neither material degradation nor technological obsolescence threatened stored information.  What digital library research would still be needed?

[16]    Current digital technology can manage collections far larger than LoC might be considering.  E.g., IBM Content Manager™ can serve documents from different machines than hold its catalog, which itself can be partitioned among machines.  This can be used to avoid performance bottlenecks even for an extremely large collection that it presents to users as a single library.  Collections that individually contain more than 100,000,000 holdings would be feasible; however, I do not know of any digital document collection this large.

      The size of collections will probably be limited by the human labor needed for accessions.  However, without estimates from institutions like the Library of Congress we cannot be confident about this guess.  Its validation would justify heavy investment into automating as much as possible of metadata creation and other accessions tasks.

[17]    What is available to researchers with access to good research libraries as well as Internet resources is illustrative.  Today’s tools enable quality and speed much improved over what was possible a decade ago.  Any scholar can combine information from:

·Web pages of repositories and of information producers,

·frequent newsletters collecting print periodicals’ tables of contents,

·interest-group newsletters,

·commercial Web search services,

·on-line catalogs from a few great libraries and from local public libraries,

·occasional visits to a great library, and

·fast communication with colleagues around the world. 

[18]    The Plan page 47 calls for an infrastructure that is “transparent and trustworthy”, but is silent about what is meant by “transparent”.  What balance is wanted between structure visibility and hiding from end users details that would be distracting?

[19]    If a work has once existed in tangible form, copyright protects the abstract pattern that appeared in that fixed copy.  I.e., what is essential about a work is a pattern inherent in its reproductive instances.  David Nimmer, Adams and Bits: of Jewish Kings and Copyrights, 71 S. Cal. L. Rev. 219-245, 1998.

[20]    “… preservation in the digital age must be considered at the time of creation. Preservation cannot be an activity relegated to the expertise of libraries and archives, but rather must be seen as intrinsic to the act of creation.”              The Plan page 52

[21]    However, see The Plan pages 23-29, which describes modern performance delivery methodology.  Non-technical barriers embedded in the channels that connect data sources with a public performance might impede what would be best practice in ideal circumstances.  For instance, this is likely to happen in a television broadcast created partly from ephemeral source data collected and linked by data-dependent or human decisions that are not recorded except implicitly in the performance itself.

      Ideally, capture for preservation would occur as a production side effect.  However, producers are motivated against providing access to preservationists or creating preservation copies themselves because the effort would increase costs without clear value to their employers.

      Capturing broadcast output would encounter both copyright barriers and signal degradation.

[22]    December 2005 will be five years from the beginning of NDIIPP funding.  “Substantial, publicly visible progress” includes at least funded technical research that promises quick progress and creation of large, publicly accessible collections of some “low hanging fruit.”  Both are feasible within two years.

[23]    A good software engineer, interviewing LoC staff and drawing on similar documents available from other organizations, could write a credible first draft in one month.  If this were published, it would elicit constructive technical criticism.

[24]    More parameters would be as likely to confuse as to illuminate decisions.

[25]    Michael K. Buckland, Five Grand Challenges for Library Research, Library Trends 51(4), 675-686, Spring 2003.

[26]    What’s meant here is an estimate within a factor of 3, as frequently such an estimate suggests the impracticality of approaches being considered.  Engineers and businessmen routinely make such estimates to speed urgent decisions.

      Optimistically, suppose that a cataloguer using a fill-in-the-form editor could create accessional metadata for each holding in 20 minutes and that the annual cost of such a librarian is $120,000.  Assume further that, after such metadata is available, the repository subsystem would automatically create a fully catalogued library entry, with all administrative “hooks” for access control and replication.  Such a librarian could prepare about 5000 entries annually, with each entry costing about $24.

      Suppose further that someone asked an authority in each of 100 disciplines, “How many digital objects need to be collected annually to create a useful collection for your discipline?”  Conducted telephonically, such a survey could be completed in 2 person-months.  Suppose the accumulated number is 100,000; this might be too large by a factor of 2 because of disciplinary overlaps, or too small by a similar factor because other people might feel that more comprehensive collections are needed.

      Accepting these numbers until better are available, we see that the annual cost of digital accessioning would be about $2,400,000.  This estimate would enable inquiry whether such a cost can be distributed among and absorbed by collaborating libraries, or whether it is critical to find ways to reduce the metadata creation costs.

      Note: the numbers above are untrustworthy, having an insufficient basis in fact.  The point being made is not about metadata creation costs, but rather that an inexpensive process can quickly provide financial estimates that are sufficiently precise for urgent management decisions.  A sensitivity analysis (answers to the question, “How would our decisions be changed if the estimate were 30% larger or smaller?”) would suggest whether or not economic research was warranted.

[27]    Storage hardware was considered a separate market from digital library software.  IBM procedures called for each business to be justified in itself.  So-called “drag” for associated business segments was not accepted as part of a business justification.

[28]    No-one ever inquired how we had constructed our pretty page of estimates!  (Nicely printed garbage is still garbage!)

[29]    The procedure and style was daunting to the team—one mid-level manager and three technical staff members.  Before the Gerstner meeting, we reviewed the recommendation in a meeting with two IBM Senior Vice Presidents and three divisional vice presidents.  The attendees at the Gerstner meeting were Mr. Gerstner, three IBM Sr. V.P.’s, one divisional V.P., and three task force members.

[30]    A sidelight is that IBM Research reduction in force was only 10%, as Mr. Gerstner viewed the Research Division as critical for restoring IBM health.  Today, IBM’s patent royalties illustrate that this was a good decision.

[31]    In IBM Research, the kind of relationship now emerging between IBM Content Manager and IBM DB2 was debated as early as 1989.  However, the digital library team believed that sufficient technology did not exist at that time to integrate blobs (“binary large objects”) into database management systems.  The most visible hindrance was a performance problem: software layering used for short records forced repeated data copying—to move a datum from a disk to a telecommunications port took seven copy steps.  To move blobs that way would, it was said, “freeze [the fastest computer of the time] to its tracks.”

[32]    To someone who would like to understand the field, this might seem a daunting amount of reading.  However, from the perspective of a software engineer, it contains much redundancy—much more than the primary literature of computer science and engineering.  Regrettably, many of its authors fail to identify what new information or insights their articles offer.

[33]    The zipped PDF version is about 1 Mbyte in length—too big to be convenient for my ISP service subscription.

[34]    Mark Nollinger, America, Online!  Wired 3.09, 158-61, 199-204, September 1995.  This includes, “America Online has been on a rocket ride, rapidly becoming the largest online service provider in the world.  Now it would like to morph into an ‘interactive service company’--before Microsoft and the Web eat its lunch.”

[35]    Marc Warschauer, Demystefying the Digital Divide, Scientific American 289(2), 42-47, August 2003.

[36]    SCO has sued IBM, asking for $1B in damages.  The only serious asset of SCO might be its expectation in this lawsuit!  Who wrote the content at issue is not always clear (partly since SCO declines to reveal which pieces of Linux it claims to own). 

      It is unclear murky who owns what or is licensed to which aspects of UNIX.  The former problem comes from the existence of many prior versions of UNIX.  Two versions of UNIX were created—one by AT&T and one by U Cal. Berkeley and these were licensed and/or sold to numerous companies among which a former company called SCO figures.  That the name/trademark and offerings of SCO have more than once changed hands confuse the issues.  There may also be cross-contamination between Linux created by uncompensated volunteers and protected by something called the GNU Public License and the version of Unix that SCO claims has been improperly copied from.  For reasons that I do not yet understand, this brings the validity of any offering under the GNU Public License into question. 

      That's serious for two reasons: (1) much of the technology that runs the Internet is made available under the GNU Public License; and (2) if the GNU Public License becomes unusable, the donations (as public goods) of the volunteer army might dry up.  You know that software as "Open Source Software".

[37]    Michael White, Acid Tongues and Tranquil Dreamers: Eight Scientific Rivalries That Changed the World, Wm. Morrow, New Jersey, 2001.  ISBN 0-380-97754-0

[38]    David Salsburg, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century, Freeman, New York, 2001.  ISBN 0-7167-4106-7  The “Student” episode is the subject of Chapter 3.

[39]    James Gleick, Chaos: making a new science, Penguin Books, 1988.  ISBN 014-009250-1

[40]    Raymond Leppard, Authenticity in Music, Faber Music, 1988.  ISBN 0-571-10088-0

[41]    I first saw this in 1997, but don’t know its origin.  If markets were perfect (a doubtful proposition), it would teach merely that consumers prioritize differently for computers than for automobiles.

      Some people might consider this a lampoon of Microsoft Corp.  They should recall that before the introduction of the IBM Personal ComputerÔ in 1981, what we today call a “personal computer” was called “micro-computer”.

[42]    For other e-mail clients, compilation and linking of the Python source distribution is needed.  SpamBayes is distributed as source code to make it accessible for incompatible computing platforms.  As I have not tried this distribution form, I cannot say that it is easy enough for non-technical users.

[43]    The prices are mostly from San Jose Mercury News advertisements.  Better deals might be available from on-line shopping services.  To facilitate “level playing field” comparison, sales taxes and shipping costs are included in the estimates.