Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 3, Number 3, 3Q2004

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2004, H.M. Gladney

 

ISSN: 1547-8610

Digital Preservation

“While a writer has few readers, and no influence except on independent thinkers, the only thing worth considering in him is what he can teach us: if there be any thing in which he is less wise than we are already, it may be left unnoticed until the time comes when his errors can do harm.”        John Stuart Mill[1]

Research Needs Reconsidered

More than 18 months have elapsed since an EU- and NSF-sponsored work group (WG) completed its meetings about preservation research themes.[2]  Its 2003 conclusions asserted, “All the areas of research described here will produce results that will have a significant impact on the efficiency and effectiveness of digital preservation.”  Since then, the literature reports improved insights into the challenges.  In view of new funding for digital preservation research,[3] refinement of the WG conclusions—with more precise language and citations to directly pertinent prior art—is appropriate.[4]

In what follows, we take the view that ‘research’ has to do with unanswered questions and is different from software development, service deployment, and professional development[5] (even of an entire disciplinary community).

Re-examination of the pertinent technology has exposed large gaps between what is known and the tools that the cultural heritage community seems to want and the administrative conditions it apparently expects.  Some of the functional needs have already been addressed by plausible in-principle solutions, but are not yet represented by practical implementations that can be assessed by would-be users. 

Complete, turnkey software offerings do not exist, and commercial suppliers seem to have no plans to provide them, perhaps because they do not see a viable marketplace emerging in the near future.[6] 

We believe it technically feasible to provide software for much of what’s needed in two to three years, but do not know who might provide this software or maintain it over time.  Whether NSF funding should extend to product-quality software creation is a policy issue about which we choose to remain silent.

Prior DDQ numbers sketch related topics.  For instance, content management (a.k.a. ‘digital library‘ or ‘repository’) software has been much refined since the first offerings appeared in 1993 and since two rounds of NSF digital library funding were completed.  It is represented by recent open source repository offerings,[7] and also by new commercial offerings on Linux platforms.[8]  Some of the latter seemingly address requirements raised by the WG.  It therefore seems appropriate to interpret ‘digital preservation research needs’ to include only challenges caused by deterioration of media, changes of digital representation, and fading human recall, and to exclude digital repository and information-finding needs that would occur even in a world free of information degradation over time.[9]

Recommendations about digital preservation practices sometimes trespass by offering socially unacceptable prescriptions about how specific academic communities should behave.  Such recommendations are flawed partly because they are ill informed and partly because people dislike to be told what to do or how to do it.[10]  The boundary lines are fuzzy.

So that limited public funds for research or development are most effectively applied, it seems imprudent to expend them on topics that are being handled, or could best be handled, by private sector enterprises, or that belong to other NDIIPP initiatives.[11]  For such reasons, we recommend against NSF funding for storage technology (either materials or devices), for research into some kinds of software, such as database management systems and storage hierarchy components, or for deployment work.[12]

The numbering in the following table conforms to that in the WG report.  This table includes only compressed summaries of recommendations and concerns and does not tabulate those WG recommendations with which we agree.  Full text of each recommendation and a careful summary of our reasons for change are available in a DDQ 3(3) Appendix.[13]


WG recommendations (paraphrased)

Consider “What new knowledge is sought?”

1

Preservation Strategies: Emerging Research Domains

1A(1)

Elaborate existing repositories with a higher SW layer. 

Test repositories for scalability.

Such a layer was demonstrated in IBM Digital Library in 1993, and today also occurs in other offerings.[14]  

Except as addressed in recommendation 2, no generic scaling investigation is needed.[15]

1A(2)

Create repositories for software needed for emulation and rescue of other content.

This is a deployment rather than a research recommendation.[16]  Engineering needed is addressed in 3A.

1A(3)

Provide registries and repositories of format information needed for migrating information representations.

This seems to be primarily a deployment recommendation, rather than a research recommendation.  Furthermore, some such services have recently appeared.[17]

1A(4)

Provide repositories of obsolete peripheral devices—repositories that include interfaces to use such devices attached to current machinery.[18]

DDQ 2(4): “It is easy to copy even large amounts of data from aging devices to their replacements with low error rates so that media risks are dwarfed by unrelated preservation risks.” [19]

No new knowledge is needed for this.  One workable interface is the Internet File Transfer Protocol.  The principal barriers are costs of service and of straightforward engineering design to create an FTP server for each class of device wanted.

1B

Research into inexpensive and reliable archival media is required.

The low cost of and infrastructure for routine copying of files from aging to successor media, and the large cost of providing ‘archival media’ call this recommendation into question.

1C

Create generic devices capable of reading diverse classes of media.

Infeasible.

1D

Identify how the emergence of new storage devices will change digital entity encoding formats for content-based addressing and parallel processing.

No new research is needed.  Widely used architectures decouple the specifics of storage device design from the interfaces by which data are copied to/from computers’ main memories.  See text accompanying DDQ 2(3) Figure 1.

1E

Develop formal descriptive language for digital objects’ behavior so that users can test correctness of actual behavior.

Funding should be limited to proposals that promise significant progress beyond program specification and semantics languages work done between about 1980 and 1995.[20]

1F

Research agents and self-awareness among digital entities.

“Self-awareness among digital entities” is anthropomorphic nonsense.  It needs clarification into something clearly feasible.

1G

Research accelerated aging of media, systems and software.

We disagree with public funding for such work.

1H

Develop methodology to preserve the knowledge[21] inherent in digital entities and their interrelationships.

This seems to comprise two distinct needs: (1) behavioral issues addressed in 1E and (2) metadata capture for business object collections.

2

Re-engineer Preservation Processes (to reduce the human labor they require)

2D

Estimate the costs and efforts for large [preserved digital content] collections.

Is this really worthy of research funding?  It seems to require only routine software engineering attention.

2E

Create methods and tools with which users can estimate the completeness of a collection.

The notion, ‘completeness of a collection’, is mostly subjective,[22] depending essentially on the purpose of the user and being, for authors, an issue of scholarly merit.  Absent specific suggestions of broadly useful and answerable questions, we believe this recommendation inappropriate.

2F

Articulate the impact that new distributed storage strategies, such as grid storage, have on the naming, management, discovery, and delivery of digital resources.

This should not be supported by digital preservation program funding partly because such questions already need to be addressed by the proponents of the technologies in question, and partly because the questions are easily answered.[23]

3

Preservation of Systems and Technology

3A

Develop methods for preserving data stored with emerging formats.

The problem is solved in principle,[24]  but software engineering work is needed.

3B

Devise methods to preserve complex and dynamic data.

How to fix dynamic data has long been known.  Other sources of data complexity are covered by recommendation 3A.

3G

Develop an understanding of repurposing of digital content in the expectation of changing markets.

This topic is much broader than digital preservation, and will depend on domain expertise beyond the bounds of information science, computer science, and librarianship.

A Different View of the Research Challenges

Consider what someone a century from now might want of information stored today.  This person might be a scholar who wants to interpret our writings and to decide whether to trust them, a businessman who needs to guard against fraud, or an attorney surveying fiduciary records.  For some applications, information consumers will want, need, or even demand evidence that information used is authentic—what it purports to be, as represented by a firmly bound statement of provenance.  For every intended application, they will be disappointed by lost information that they learn once existed.  For every application, they will be disappointed by information that they can no longer read or otherwise use as they believe was originally intended.

In what follows, we try to emphasize objectively decidable aspects, separating these from subjective factors.  For any subjective factor, we believe it critical to identify whose decision is important.

Notice also that we emphasize end user needs—what people acting in well-defined roles might need or want to accomplish specific tasks—in contrast to the EU-NSF recommendations above, that are centered on how digital repositories might work.  In fact, as the reader will see in the description of Trustworthy Digital Object methodology, most of the new software needed for digital preservation is workstation client software rather than repository server software!

Figure 1: Information transmission channels, identifying human roles and intermediate object copies, 0 through 10, the names for document instance representations.

Figure 1 helps us discuss communication reliability challenges.  Since eventual users of preserved information might suffer harm or loss if they are misled, we pay attention to the potential distortions in the channel that transmits an input 1 to become a replica 9.[25]  This suggests the technical challenges of digital preservation—finding, demonstrating, and testing methods for:

·      Ensuring that a copy of every preserved document survives as long as it might interest someone;

·      Ensuring that consumers can use any preserved document as its producers intended, avoiding errors introduced by third parties that include archivists, editors, and programmers;

·      Ensuring that any consumer has the information to decide whether information received is sufficiently trustworthy for his use;

·      Hiding information technology complexity from end users (producers, archivists, and consumers);

·      Minimizing labor costs by automating clerical steps; and

·      Empowering editors to package information so as to relieve overloading of professional cataloguers.

For economic practicality, viable solution proposals must allow both repository institutions and also individual users to exploit already deployed and expected future technology[26] without disruption—technology offerings from third parties in an open market.  These must conform to software interface standards and conventions that permit “mix and match” from competing providers—standards and conventions that, over time, will be improved over today’s versions.

TDO Digital Preservation Progress

Thibodeau described the state of digital preservation know-how with: [27]

“The state of affairs in 1998 could easily be summarised:

·          proven methods for preserving and providing sustained access to electronic records were limited to the simplest forms of digital objects;

·          even in those areas, proven methods were incapable of being scaled to a level sufficient to cope with the expected growth of electronic records; and

·          archival science had not responded to the challenge of electronic records sufficiently to provide a sound intellectual foundation for articulating archival policies, strategies, and standards for electronic records.

We believe that we know an in-principle solution to every technical problem alluded to and that much of this insight is documented in a form permitting objective and specific critiques.  Before indicating where this work can be found, let us point out that we could not have progressed without explicitly focused attention to three well-known elements of scientific and engineering methodology: (1) careful attention to the interplay between the objective (here, tools that could be brought to bear) and the subjective (human judgments, opinions, and intentions that cannot flourish in too tightly controlled circumstances); (2) focus on the actions of individual people, rather than on the abstractions that we call “-ities” (authenticity, integrity, quality, …); and (3) “divide and conquer” into manageable pieces that build on and allow other people’s contributions.  For digital preservation, we see the following topics that interact relatively lightly and that can therefore be almost independently handled.[28]

Figure 2: TDO structure

 I.      Some number of socially communicated languages and standards that are not themselves parts of the technical solution, but that are needed starting points.

II.      Packaging (encapsulation) of a work together with metadata that includes provenance documentation and articulation of the links (references) binding TDO pieces with each other and with external packages. External TDOs are essential context for correct interpretation and evaluation of any work. (See Figure 2.)

III.      Topic-specific ontologies provided and maintained by academic and other professional communities.

IV.      A blob-encoding scheme to represent each content piece in language that is insensitive to irrelevant and ephemeral aspects of its current environment, and that therefore protects what is essential from the ravages of technology obsolescence and fading human recall.

V.      Repositories (a.k.a. Digital Libraries or Content Managers) that store packaged works, and that provide search and access services whereby information consumers can find and obtain what interests them.

VI.      Replication mechanisms that protect against the loss of the last remaining copy of any work.

Some of our documentation (work in progress since mid-2002) is, or will soon be, available on-line in preprint form, as follows:

What Do We Mean by Authentic? has been published (D-Lib Magazine 9(7), July 2003).  It shows what vernacular meanings of ‘authentic’—meanings that are different for different object genres—have in common and how to construct the objective definition needed for preservation work.

Trustworthy 100-Year Digital Objects: Evidence After Every Witness is Dead has been published (ACM Trans. Office Information Systems 22(3), 406-436, July 2004).  It continues to be available from the ERPAnet preprint server.  Focusing on the second challenge above, it describes the structure and use of TDOs (Figure 2), including some key architectural elements that will be implemented in XML:

(i)     Each TDO contains its own world-wide eternal and unique identifier and its own provenance metadata and is cryptographically sealed;

(ii)    External references are also sealed together with the identifiers of their referents;

(iii)  A network of certification keys is grounded in published and frequently changed keys of trustworthy institutions.  Final sealing of a preserved document by such an institution creates durable evidence of its deposit date.

Trustworthy 100-Year Digital Objects: Durable Encoding for When It's Too Late to Ask has been submitted for publication in the form available from the ERPAnet preprint server.  It teaches a method of encoding any kind of data whatsoever to be forever useful. This method would be applied to most kinds of content blob called for in Figure 2.  Its key ideas include:

(iv)   That we can and must enable information producers to separate irrelevant environmental information from information essential to each producer’s intentions, encoding only this essential information.

(v)    That extended Turing-complete virtual machines can represent anything that can be written;

(vi)   And that such machines can themselves be described completely and unambiguously.

Trustworthy 100-Year Digital Objects: Syntax and Semantics—Tension between Facts and Values has been submitted for publication in the form available from the ERPAnet preprint server.  It provides epistemological arguments justifying that the methods described in the immediately prior two papers do as much as mechanical methods theoretically can do towards preserving digital information, and that these methods attempt no more.  It further argues that the TDO methodology defines a quality standard against which any digital preservation method should be judged.

Trustworthy 100-Year Digital Objects: What's Meant?  Intentional and Accidental in Documents is half done.  We expect to post a preprint version on the ERPAnet server before the next DDQ number is published.  It will use early 20th-century philosophy to examine what information producers can do to minimize eventual readers’ misinterpretations, given that communication invariably confounds what it intends to convey with accidental information.

Request for critical reviews

The reader will surely notice that we point at no Web site for downloading software that would put the described ideas to work.  What we so far have are only limited prototypes.

In addition to the obvious administrative reasons for such a temporary shortfall, there is a compelling reason to "get it right".  The creation and use of a flawed preservation method would be accompanied by significant risk that the flaw(s) might not be discovered until many years later, and until after a large investment had been made into creating archival holdings that proved to have errors that sometimes distorted their meanings (for texts) or actions (for programs). 

We believe systematic errors to be of more concern than (mere) programming implementation errors.  Such systematic errors include questions that reach into epistemology—the philosophical theory of what is knowable, in contrast to what must forever remain questions of belief and/or taste.  We are therefore reluctant to build and release any portions of our projected solution until we believe that appropriate experts have examined and challenged our arguments.

We claim that correct TDO implementations:

·      Would allow preservation of any information that can be saved;

·      Would be as efficient as any competing solution (none has yet been proposed);

·      Could be brought into service without disrupting any repository service; and

·      Need not include any proprietary software.[29]

Therefore we request the most searching critical examination readers can provide of the work described, and communication of your views concerning our errors and omissions.  We would be happy with either private or public communication, and actually prefer public criticism over private.  “Getting it right” is simply too important for anything short of complete transparency.

Another Way to Make Documents Trustworthy

A remark whose source I do not recall (perhaps an Andrew Waugh article?) suggests a different method of making testable the authenticity of a preserved document.  If the same document has been independently stored in several individually credible repositories, its eventual consumer can test that the supposedly independent instances are sufficiently similar.

For this to be proof against fraud, there must be accessible unforgeable evidence that the document’s producer himself delivered each instance to a credible independent repository, rather than that a single deposited instance was copied among repositories.  This might be made verifiable by the firm binding of each repository’s credible assertion that it surely received its instance from the producer rather than from some third party—a provenance certificate for its holding.

Any reader who cares to do so can surely work out the details whereby a, repository can test, prove, and certify that the provider of a document copy is also its producer.[30]

Faintly Ironical

Suzy Palmer, Editor-in-Chief of Microform and Imaging Review, recently circulated a call for comments on an Association of Research Libraries (ARL) report, Recognizing Digitization as a Preservation Reformatting Method.  Its prefatory statement included, “Over the past several years, libraries have moved towards using digitization as an additional method for reformatting endangered and fragile paper-based materials to both preserve and provide access to library collections.”

Of course we believe the ARL move reasonable.  We nevertheless see the announcement as faintly ironical.  The irony is created by a preservation context replete with published hyperbole about digital documents being relatively fragile compared to paper-based documents.[31]

Query: What Was New in Digital Library?

A possibility for some future DDQ number is a description of the seminal architectural ideas behind digital libraries.  Samples of the insights that we have in mind are:

·        The “unit of work” notion for integrity of database transactions.  (I learned this from the IBM Research designers of the first relational database prototypes.)

·        That it would be necessary to combine file servers with database servers to obtain acceptable repository performance.  (I learned this in 1987 from David Choy, who had considered database system designs in the light of how IBM’s OS/MVS passed character strings between subroutines.)

·        IBM’s Data Links technology that permits a database management system to assume administrative control of files without requiring any change of existing programs that use and modify them.[32]  (This was invented in about 1993 by Luis-Felipe Cabrera.)

These examples illustrate that I am most familiar with IBM Research contributions.  I am concerned that I might overlook ideas from other sources, and plan not to publish the prospective article until I am confident that any blindness is remedied.  I therefore request readers’ suggestions of the seminal ideas that enable current and future digital library design.

Linux and Open-Source

LinuxWorld and Software Selection

Since I hope to escape the Microsoft near-monopoly some day, I have several times attended the annual LinuxWorld trade show in San Francisco.  I have yet to find what I’m looking for.

The August show included SW components of potential interest for every kind of document management service.  The resources and skills devoted to scaling, performance, economy, and reliability of repository components seem to be far greater than can be funded by NDIIPP, making it essential to design preservation solutions that leverage what others are already working on.

There is a mismatch—a semantic dissonance—between the language and expectations of many digital preservation community spokespersons and those of the technology vendors (e.g., with respect to ‘scaling’ in the research recommendations above).  Current emphasis among technology vendors is on components, whereas cultural depositories want customizable “solutions”.

Part of today’s commercial response is to offer “services”.  For instance, roughly half of IBM’s 2004 revenue will be from contract services, a business sector that hardly existed 5 years ago.[33]  This phenomenon contributes to another cultural mismatch: academic libraries are not emotionally, practically, or financially prepared to use such outside services, even though they do not seem to have sufficient internal skills for the middleware component of digital repository services.

What was offered at the LinuxWorld trade fair was confusing in the sense that I saw no broadly accepted model by which the components offered could be assembled into solutions.  Perhaps this is a passing problem, with “middleware” models yet to be invented and standardized—as has occurred repeatedly in the history of EDP refinement of lower component layers.  Several trade fair booths exhorted the need for layer interface standards.

Linux Desktops and Laptops: Has Their Time Come?

Red Hat recently announced a new desktop Linux variant, including corporate support, the GNOME interface, and the Evolution PIM.  The company expects the cost to be about $70 per desktop per year.

In Linux on desktop gaining OS race the SJMN technology columnist, Dan Gillmor writes, “Linux may be just fine for a second desktop system at home.  But for corporate road-warriors it's still not quite ready.”  See also a Linux Journal article and an ACM Queue article, Desktop Linux: Where Art Thou?

BusinessWeek and Ziff-Davis reporters visited the LinuxWorld trade show looking for Linux laptops and desktops.  Ziff-Davis sums up their disappointment with, “Early on this week, we thought this year's LinuxWorld would be a desktop lovefest. Alas, it appears we were too optimistic, …  so we'll have to wait even longer for a real Linux-based competitor to Windows.”

The near-monopoly that Microsoft enjoys on the desktop is not entirely the result of technical prowess.  On 2nd September, Linspire circulated a memorandum describing financial tie-ins by which Microsoft pushed Dell™ executives to quash initiatives to make Linux available on its PC offerings.

GIMP for Windows

PC Magazine tested the Windows version of the Linux image handling package, the GIMP.  It reports painless installation and execution under Windows, but that “in terms of feature breadth and ease of use, GIMP can't compete with … Adobe Photoshop Elements 2 and Jasc's Paint Shop Pro 8.”

Linux Statistics[34]

1.    Worldwide server sales, in billions of dollars, 1Q04:                                                $11.5 B

2.    Year-to-year revenue growth in server sales:                                                             7.3%

3.    Year-over-year server unit shipment growth:                                                            22.4%

4.    Windows server shipment growth:                                                                           26.5%

5.    Linux server ship­ment growth:                                                                                46.4%

6.    Number of consecu­tive quarters of double-digit Linux server revenue growth:               7

7.    Thousands of Linux­-certified practitioners at IBM:                                            3

8.    Thousands of people at IBM that have "some kind of Linux exposure":                   12

News and Commentary

Incredible Shrinking Chips; New Chipset Features

Intel says it will debut a 70-megabit memory chip with 35-nanometer transistors in 2005.

PC Magazine and eWeek suggest that two new Intel chipsets (921 and 925) will transform bus architectures, storage and performance, and reports the details in a series of articles.[35]

How Difficult Business Decisions are Made

In 1983, Sony and Philips were negotiating a joint standard for what became the CD that we all know.[36]  A final decision remained—whether the audio sam­pling rate would be 44.1 kHz or 36 kHz.  It was agreed that each disc needed to hold 72 minutes of audio, because Beethoven's Ninth Symphony was that long.  Philips favored 36 kHz, (ironically) partly because it matched a telecom standard for easy music downloading and trans­ferring.  Sony preferred 44.1 kHz sampling rate, because it accommodated the upper reaches of human hearing (~20,000 cycles/sec.)

The decision was made in Hawaii.  With arguments running into recreational time, Bjorn Blutgen of Philips and Toshi Doi of Sony took to surfboards still bickering.  One challenged the other to a surfing match: whoever fell off the board first would lose.  The Dutchman lost, so we share CDs sampled at 44.1 kHz.

Internet speed record

The Internet speed record, announced at the Spring 2004 Internet2 meeting, was for transmitting data over nearly 11,000 kilometers (California to Switzerland) at an average speed of 6.25 gigabits per second—nearly 10,000 times faster than a typical home broadband connection.

IBM to transfer code to open-source group

IBM announced that it has contributed half a million lines of software to an open source software group, in order to make it easier to write Java applications.  (The more Java applications that are written, the more potential uses there are for IBM's WebSphere platform for managing applications.)

Reading Recommendations

“Obviousness is always the enemy to correctness.”                                                       Bertrand Russell[37]

Monk’s Bertrand Russell: the spirit of solitude [38]

The colloquial phrase ‘to be philosophical about’ means to accept difficulties and misfortunes with objective equanimity, or stoically (after the Greek philosophical school).  The idiom might not be applicable to philosophers, who tend to be passionate about their professional challenges.

Monk's life of Bertrand Russell is penetrating and highly critical of one of the last century's most influential intellectual figures.  Most of it is about Russell’s emotional life, that included two marriages, many love affairs, and fears of madness.  Detail about what most people hold private is fed by confidences expressed in Russell’s lifetime output of approximately 40,000 letters.

It is, however, for Monk’s description of the interaction between Russell’s emotions and his philosophical work that we recommend this biography.  Also Ludwig Wittgenstein is portrayed as raging and often doubtful about his own work, and remarkably direct at criticizing Russell’s thinking.

Dauber’s Biography of Georg Cantor

Mathematicians of the latter half of the 19th century were troubled by interrelated notions of continuity (between any two points of a line there are other points), infinity (some collections are uncountable), and infinitesimals (essential to differential calculus).  That we no longer face these challenges is a debt we owe to Georg Cantor.  His professional life was difficult because insights that we now take for granted shook metaphysical notions of reality. Dauber’s biography[39] is a readable account of the controversies, as its introduction suggests:

“For anyone concerned with intellectual history, in fact, the de­velopment of Cantorian set theory may be regarded as a microcosm in which the nature of the creation and development of a significant new idea of science may be studied.  It provides a model that is ideal in many respects.  Cantor's revolution of the mathematical infinite was created almost single-handedly, in the space of a few years.  Original opposition and rejection of his work, not only by mathematicians, but by philosophers and theologians, eventually gave way to acceptance by some and to wholly new theories and domains of study undertaken by others.”

The Social Side of Documents

DDQ is biased towards technical and philosophical topics about information and its carriers—documents.  However, we attempt to keep strongly in mind social, economic, and practical aspects, and are assisted by two books emanating from the now-extinguished Xerox PARC—books good enough to recommend to readers.  Seely-Brown and Duguid[40] point out that:

“… good design is very hard to do.  It is easy and understandable to make fun of bad technologies.  It is not easy to make good ones.  Given the difficulties of design, however, it is important not to misrepresent the task it faces.  Too often, information technology design is poor because problems have been redefined in ways that ignore the social resources that are an integral part of this socialization process.  By contrast, suc­cessful design usually draws on these social resources, even while helping them change.”

Their account of Alexander Graham Bell’s difficulties with persuading customers that the telephone was an advance over telegraph technology, and how he eventually succeeded, might illustrate current challenges of persuading people about the validity of any digital preservation solution.[41]

David Levy’s Scrolling Forward[42] does not pretend to say much that’s new, but eloquently reminds us that digital documents are more similar to than different from older documentary forms.

Access To NARA Databases.

With more than 400 files containing millions of online records, NARA’s Access To Archival Databases is a U.S. information source worth looking at.

Wikis

According to the Wikipedia, a WIKI is "a … hypertext document collection that gives users the ability to add content, as on an Internet forum, but also allows that content to be edited by other users.”  See Dave Mattison’s article about WIKIs.[43]  

Practical Matters

Home Personal Computer Reliability

The computer industry communicates its priorities in promotional materials—advertisements and press releases.  For mainframes and server arrays these industry priorities seem to be reliability, performance, and functionality, in decreasing priority.  For personal computers, the industry priorities are reversed; functionality is advertised more vigorously than performance, and reliability is hardly advertised at all.[44]

Home PC users that depend heavily on their machines for productive use[45] should remedy shortfalls in reliability by additional acquisitions sketched below using examples of my own practice.  The additional expenditure might be as great again as that for a primary PC installation, but need not exceed that.  Since a very capable PC can be had for $1,000 (and much less than that in Silicon Valley), doubling the expenditure is mere prudence for anyone whose usage is important and whose time is valuable. 

What do I do?  Four things: (1) use separate logical disk drives for operating system, for application programs, and for my own files, (2) avoid testing risky upgrades on the system I use for writing, (3) make automatic and frequent file backups onto an external hard disk drive (HDD), and (4) copy my work onto optical disks every few weeks.

I have a second PC system ready to boot with either an approximate image of my Microsoft Windows/2000™ system (later versions of Windows do not offer enough to persuade me to upgrade) and a Linux system that I am using to evaluate an eventual switch to that base.  If my main system fails seriously, as has twice happened in two years, it might take a few days to a few weeks to diagnose and perhaps repair the problem.  So I promptly consider the second PC to be my main system.  Since all files except those of the operating system are backed up on an external HDD that can be reattached momentarily, it takes only a few minutes work to bring the second system up-to-date.

Whenever it becomes time to upgrade a system (every three or four years), I buy a “bare bones” successor and move the durable peripheral devices (HDDs, monitor, and other external stuff) to the new acquisition.  I never acquire the very latest or fastest offerings, because their bugs have not been flushed out by customer usage and because they command premium prices.  With this approach, I could today replace my secondary system for $400. 

The price of external HDD’s have recently dropped to only small premiums over internal devices.  I am using a 120 Gigabyte IOGear drive that I obtained for $65 (tax included) several months ago.  I like the backup and recovery software that IOGear provided without extra charge.  I have chosen that every two hours it makes a copy of every file changed but not yet backed up, and that it keeps up to three versions of any file.  Since 120 Gb is more than twice as much space as I currently need, I expect this device to suffice for several years.

Whenever I think my data have changed enough so that my work could be seriously delayed by a fire,[46] I copy what’s at risk onto an optical disk that I store in a safety deposit box.  Writable DVD drives are becoming inexpensive, so that when my writable CD failed a few weeks ago, I purchased a DVD replacement.  I bought primarily on price, considering only manufacturers with good reputation, paying  $75.  I found DVD-R blanks for $0.38 each.  That a DVD holds 4.5 Gb (compared to 700 Mb on a CD) was attractive because my working files (including Web downloads and digital photographs) currently require about 6 Gbyte.

Top 25 PC Utilities for 2004

PC Upgrade magazine provides[47] recommendations of PC utilities as indicated in the first four columns of the following table.  The final two columns indicate whether or not I use each recommended offering and provide a hopefully helpful comment.

Category

Product

Source

List price

Use?

Comment

Anti spyware

McAfee Antispyware

McAfee

$39

No

I use Webroot Spy Sweeper

Anti virus

Norton AntiVirus 2004

Symantec

$49

Yes

And am satisfied

CD ripper

CDex

SourceForge

Free

No

I have no need

Data encryption

PGP Personal Desktop

PGP Corp.

$50

No

I have no need

Data recovery

Undelete 4.0 Home Edition

Executive Software

$29

No

Frequent automatic backup makes this unnecessary

Desktop organizer

Mult Desctop 2004

Gamers Tower

$24

No

Interesting enough to look into

Disk copier

Norton Ghost

Symantec

$70

Yes

I have it, but seldom use it

Disk partitioning

PartitionMagic 8.0

Symantec

$69

Yes

Infrequent use, but it’s great to have!

Disk maintenance/backup

Driver Magician

Gold Solution Software

$29

No

 

DVD burning software

DVD Copy 2 Platinum

Intervideo

$79

No

An equivalent Nero package was included as part of my DVD-writer purchase.

Favorites organizer

Bookmark Buddy

Edward Leigh

$29

No

Powerdesk and Firefox do it for me.

File backup

AutoSave

V Communications

$29

No

Iomega backup came w/o extra cost with external disk

File compression

StuffIt Deluxe 8.5

Alladin Systems

$39

No

Zip tool comes with Powerdesk w/o extra cost

File management

PowerDesk

V Communications

$39.

Yes

And I love it!  Includes several extra key utilities.

File search

dtSearch Desktop

dtSearch Corp.

$199

No

Powerdesk search tool serves me well.

File transfer

WinSCP

SourceForge.net

Free

No

I have no need

Firewall

ZoneAlarm Pro

Zone Labs

$40

No

A firewall is embedded in my Enternet router

Image file viewer

ACDSee 6.0 Power Pack

ACD Systems

$79

No

Powerdesk provides me this function w/o extra cost

Pop-up blocker

Google Toolbar

Google

Free

Yes

 

Remote control

I’m InTouch

10 Comminiqué

$100/year

No

I have no need

System restoration

Set Point

Easy Desk Software

$15

No

But I will look into it

Task manager

Process Explorer

SysInternals

Free

No

But I will look into it

Utility suite

Norton SystemWorks 2004

Symantec

$69

No

Other tools installed

Windows enhancement

PowerToys

Microsoft

Free

No

Only for Win/XP

 



[1]     Mill, John Stuart, The positive philosophy of Auguste Comte, Henry Holt and Co., 1873,  page 6. 

[2]     See "Invest to Save"—the Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation, 2003.

[3]     As part of the NDIIPP Plan, the National Science Foundation (NSF) has called for preservation R&D proposals in Digital Archiving and Long-Term Preservation (DIGARCH), NSF RFP 04-592.  This RFP targets technical questions.  Other NDIIPP initiatives exist, or are being considered, for aspects such as collection building, deployment, and professional education.  See, for instance, an NDIIPP intiative to stimulate partnership networks.

[4]     The 2003 recommendations included few prior art citations, perhaps because of the limited time available for WG deliberations.

[5]     Some authors have emphasized library and archive community self-education.  The WG recommendations do not carefully distinguish such needs from research issues, i.e., do not distinguish ‘what many people do not know’ from ‘what is not known’.

[6]     The number of cultural heritage repositories ready for practical action is small compared to the number of commercial customers for content management solutions.  The special technical requirements of the cultural repositories seem to be nowhere articulated in terms to which software engineers can respond.

[7]    See, for instance, descriptions of Cornell’s FEDORA and of MIT’s Dspace repository software packages.

[8]     This includes the IBM DB2 Content Manager, that has been ported from its prior IBM OS/MVS and Microsoft Windows versions to Linux platforms.  Most of the source code of all its platform versions is common, making available a decades’ efforts towards reliability, scaling, and performance.

      Nothing in DDQ should be construed as a recommendation of any offering over its competitors.  We have not made careful comparisons, as they would be possible only in the context of specific customers’ requirements statements.  That IBM products are mentioned more than other vendors’ offerings simply reflects that I am most familiar with these—particularly the content management and digital security offerings on which I worked before retiring from IBM Research.

[9]     Document preservation and cultural collection management ([CLIR/LC], [RLG 2]) overlap only incompletely.  See DDQ 1(2).

[10]    This is a socially sensitive and perhaps controversial issue.  In IBM Research and probably much more widely, part of this issue is so widely known and so troublesome that it has its own acronym—NIH, for ‘not invented here.’  All too often, it is associated with territorial jealousies and unwarranted ad hominem attacks that include, “You don’t really understand the problem!”

[11]    Another NDIIPP funding intiative, “The Archive Ingest and Handling Test (AIHT), is designed to identify, document and disseminate working methods for preserving the nation's increasingly important digital cultural materials, as well as to identify areas that may require further research or development”.

[12]    For instance, to research a new storage material to the point at which development managers begin to consider it seriously for product development might cost $50M; furthermore, industrial laboratories have deeper pertinent experience and expertise than is common in academia.

      Consider WG recommendation 1G.  Were any materials or device work to be undertaken in academia, the likely venues would be materials science and engineering departments.  The WG did not include representation from these communities.

[13]    DDQ uses the first person plural to indicate consensus among DDQ advisors.  As we are not 100% confident about our opinions, we invite critical reactions.  If a reader so desires, DDQ will publish his/her concise criticisms.

      We do not intend an absolute denial of any recommendation.  In some cases it might be possible to identify missing essential knowledge.

[14]    DDQ 3(2) Figure 2 suggests how institution-specific access control and also Stanford’s LOCKSS service should be integrated within repository middleware.

[15]    To choose the configuration for each specific enterprise repository, performance estimation and modeling is likely to be necessary.  This cannot be done without a careful requirements analysis that includes quantitative document traffic and user request estimates.  Making such estimates is conventionally part of the planning for acquiring a computing installation and bringing it into service.  I.e., while we believe scalability research is not at this time needed for digital preservation, we are confident that scalability design is needed for each specific repository installation.

[16]    In a 7th Jun 2003 listserv posting, Cal Lee wrote: “Best to have many copies of the analog bootstrapping tool distributed around the world, rather than depending on one or two copies that have been written to long-term media.  For an application of this sort of approach, see the Long Now Foundation's Rosetta Project.”

[17]    Darlington, Jeffrey. PRONOM—A Practical Online Compendium of File Formats, RLG DigiNews 7(5), October 2003.  See also JSTOR’s JHOVE service for format-specific identification, validation, and characterization of digital objects, and Clausen, Lars. Handling File Formats, May 2004.

[18]    The Computer History Museum is well positioned to undertake such work, if it were inclined to do so.

[19]    120mm optical disks will increase in density from current DVD capacity (under 5 Gbyte) to hold 1 terabyte per platter, making the kind of copying alluded to even cheaper than when we estimated it for hospitals. (HMG et al., Can Hospitals Afford Digital Imagery? in R.G. Jost, editor, Medical Imaging VIII: PACS Design and Evaluation, Proc. SPIE 2165, 613-628, (Feb. 1994)). This will occur in stages; the next has begun.  Both disks and disk drives will steadily become cheaper, eventually to consumer-level prices.  I.e., the industry is poised for large improvements without large new research initiatives.  That, and comparable magnetic disk technology, is the likely ‘winner’ for the next 20 years, if not for longer, partly because these technologies will not require billion-dollar factory retooling.

      Suppliers will not make available each higher optical storage density until the market for its predecessor is shrinking.  (I.e., such changes are determined by market behavior more than by technology availability.)  For the CD to DVD transition, this took about a decade.  The transition is beginning to optical disks with capacity about five times that of DVDs.  At every stage, consumers (and libraries) will be offered easy and inexpensive means to copy existing data onto subsequent media, and application access to the new copy will be as easy as it had been for the superceded technology.

      My IBM Research advisor on optical and holographic storage technology suggests that, although some people’s hope for a long-lived storage medium is well known, the industry consensus is that nobody will provide it in the foreseeable future.

[20]    Gordon, Michael J. The Denotational Description of Programming Languages: An Introduction, Springer-Verlag, New York, 1979; Bjørner, Dines.  Jones, Cliff B.  Formal Specification and Software Development, Prentice-Hall, 1982.

[21]    See DDQ 3(2), Preserving Knowledge.

[22]    Certain kinds of missing information might be objectively obvious, e.g., software for rendering a saved digital object.

[23]    On distribution, see papers on Stanford’s LOCKSS.  About naming, see DDQ and Safeguarding Digital Library Contents and Users: a Note on Universal Unique Identifiers, D-Lib Magazine, (April 1998).  Solutions for naming and identifier choice are today addressed in tutorial material.  See, for instance, an ERPANet background on persistent identifiers.

[24]    Lorie, Raymond A. A Methodology and System for Preserving Digital Data, JCDL 2002.

[25]    No producer knows the risks that users of his information might face.  The marginal cost of a most reliable solution is not much greater than that of any solution at all.

[26]    There is today keen competition and much innovation in the software layers that implement storage subsystems and database management tools.  Scarcely a week goes by without new offerings with improved scaling, reliability, security, and cost—aspects much mentioned as needed for digital repositories.  We believe it both highly desirable and readily feasible that digital preservation solutions should not impede repositories’ access to such enhancements.

[27]    Kenneth Thibodeau, Knowledge and action for digital preservation: Progress in the US Government, Proceedings of the DLM-Forum 2002 Workshop, @ccess and preservation of electronic information: best practices and solutions 175-9, 2002. 

[28]    We would appreciate a comment from any reader who believes this list to be incomplete or otherwise not as good as it might be.

      Some of the listed topics themselves can, and of course should, be further partitioned within solution design activities.

[29]    This does not mean that all prior existing file formats will be accessible, because some vendors deliberately make essential information inaccessible.  E.g., for Microsoft Word, to separate the parts of the MS software that would need to be preserved in order to perpetuate "look and feel" of a '.doc' file from portions of MS Office and MS Windows, one would need interface specifications that we believe Microsoft has not revealed and will be most reluctant to make available.

[30]    For realistically suspicious consumers, the solution must be proof against independent misbehavior by anyone, including any repository employee.  Recall Cliff Lynch’s emphasis of Gustavus Simmons’ phrase “pervasive deceit” in Authenticity and Integrity in the Digital Environment: An Exploratory Analysis of the Central Role of Trust, CLIR Reports 92, pp 32-50, 2000.   

[31]    See Digital Storage Media as Durable as Paper in DDQ 2(4).

[32]    Details of the IBM DB2 Data Links Manager are available at http://www-3.ibm.com/software/data/db2/datalinks/ and documentation to which this web page refers.  b

[33]    An example of services that need to come from somewhere is the kind of storage and network configuring illustrated by Fierros in Doing More with More, DB2 Magazine 9(3), 32-39, 2004.  (This citation illustrates only a few concerns among many.)

[34]    Excerpted from Linux Journal, Sept. 2004

[35]    It's a New Era in Desktopstesting of the chips and new desktop systems from Dell, Falcon Northwest, HP, Velocity Micro.

      A look at every aspect of the new chipset, from the PCI Express Bus architecture to Intel's Intel Matrix Storage Technology.

      A visual tour of the chips, architecture, motherboards, installation kits and more.

      What's Grantsdale Missing?

[36]    Adapted from John C. Dvorak’s PC Magazine column, 18th May 2004.  The story is attributed to Richard Bruno, a Phillips CD project manager.

[37]    From Mathematics and Metaphysicians, reprinted in Sullivan, Arthur, Logicism and the philosophy of language: selections from Frege and Russell, Broadview Press, 2003.  ISBN 0-7923-4653-X

[38]    Monk, Ray. Bertrand Russell: the spirit of solitude, Simon and Schuster, 1996.  ISBN 0-684-82802-2 

[39]    Dauben, Joseph Warren.  Georg Cantor: his mathematics and philosophy of the infinite, Harvard, 1979.   ISBN 0-674-34871-0

[40]    Brown, John Seely.  Duguid, Paul. The Social Life of Information, Harvard U.P., 2002.  ISBN 1-578-51708-7

[41]    Of course, the issues and eventual social implications of the telephone were far grander than we believe the questions around digital preservation to be.

[42]    Levy, David M. Scrolling Forward: Making Sense of Documents in the Digital Age, General, 2003.  ISBN 1-559-70648-1

[43]    Searcher 11(4), 32-48, April 2003.

[44]    A formerly vigorous market segment, workstations, has been almost squeezed out of existence by the last few years’ improvements in PC performance.  This might be the most important reason why Sun Microsystems is in jeapordy.

[45]    Since I have no experience with PC’s for entertainment, DDQ is intentionally silent about them.

[46]    Just as I was completing this DDQ number, a friend communicated that his rural home had suffered a near-miss lightening strike.  The impact point was close enough for the thunder crash to induce tears of fright in a 10-year-old child.  Control and communication circuits in his up-to-date home suffered damage costing about $1000 to repair.  His home computer survived, perhaps because it was behind a power surge protector.

[47]    PC Upgrade July/August 2004, pp. 96-105.