Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 3, Number 3, 3Q2004

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2004, H.M. Gladney

 

ISSN: 1547-8610

Digital Preservation

“While a writer has few readers, and no influence except on independent thinkers, the only thing worth considering in him is what he can teach us: if there be any thing in which he is less wise than we are already, it may be left unnoticed until the time comes when his errors can do harm.”        John Stuart Mill[1]

Research Needs Reconsidered

More than 18 months have elapsed since an EU- and NSF-sponsored work group (WG) completed its meetings about preservation research themes.[2]  Its 2003 conclusions asserted, “All the areas of research described here will produce results that will have a significant impact on the efficiency and effectiveness of digital preservation.”  Since then, the literature reports improved insights into the challenges.  In view of new funding for digital preservation research,[3] refinement of the WG conclusions—with more precise language and citations to directly pertinent prior art—is appropriate.[4]

In what follows, we take the view that ‘research’ has to do with unanswered questions and is different from software development, service deployment, and professional development[5] (even of an entire disciplinary community).

Re-examination of the pertinent technology has exposed large gaps between what is known and the tools that the cultural heritage community seems to want and the administrative conditions it apparently expects.  Some of the functional needs have already been addressed by plausible in-principle solutions, but are not yet represented by practical implementations that can be assessed by would-be users. 

Complete, turnkey software offerings do not exist, and commercial suppliers seem to have no plans to provide them, perhaps because they do not see a viable marketplace emerging in the near future.[6] 

We believe it technically feasible to provide software for much of what’s needed in two to three years, but do not know who might provide this software or maintain it over time.  Whether NSF funding should extend to product-quality software creation is a policy issue about which we choose to remain silent.

Prior DDQ numbers sketch related topics.  For instance, content management (a.k.a. ‘digital library‘ or ‘repository’) software has been much refined since the first offerings appeared in 1993 and since two rounds of NSF digital library funding were completed.  It is represented by recent open source repository offerings,[7] and also by new commercial offerings on Linux platforms.[8]  Some of the latter seemingly address requirements raised by the WG.  It therefore seems appropriate to interpret ‘digital preservation research needs’ to include only challenges caused by deterioration of media, changes of digital representation, and fading human recall, and to exclude digital repository and information-finding needs that would occur even in a world free of information degradation over time.[9]

Recommendations about digital preservation practices sometimes trespass by offering socially unacceptable prescriptions about how specific academic communities should behave.  Such recommendations are flawed partly because they are ill informed and partly because people dislike to be told what to do or how to do it.[10]  The boundary lines are fuzzy.

So that limited public funds for research or development are most effectively applied, it seems imprudent to expend them on topics that are being handled, or could best be handled, by private sector enterprises, or that belong to other NDIIPP initiatives.[11]  For such reasons, we recommend against NSF funding for storage technology (either materials or devices), for research into some kinds of software, such as database management systems and storage hierarchy components, or for deployment work.[12]

The numbering in the following table conforms to that in the WG report.  This table includes only compressed summaries of recommendations and concerns and does not tabulate those WG recommendations with which we agree.  Full text of each recommendation and a careful summary of our reasons for change are available in a DDQ 3(3) Appendix.[13]


WG recommendations (paraphrased)

Consider “What new knowledge is sought?”

1

Preservation Strategies: Emerging Research Domains

1A(1)

Elaborate existing repositories with a higher SW layer. 

Test repositories for scalability.

Such a layer was demonstrated in IBM Digital Library in 1993, and today also occurs in other offerings.[14]  

Except as addressed in recommendation 2, no generic scaling investigation is needed.[15]

1A(2)

Create repositories for software needed for emulation and rescue of other content.

This is a deployment rather than a research recommendation.[16]  Engineering needed is addressed in 3A.

1A(3)

Provide registries and repositories of format information needed for migrating information representations.

This seems to be primarily a deployment recommendation, rather than a research recommendation.  Furthermore, some such services have recently appeared.[17]

1A(4)

Provide repositories of obsolete peripheral devices—repositories that include interfaces to use such devices attached to current machinery.[18]

DDQ 2(4): “It is easy to copy even large amounts of data from aging devices to their replacements with low error rates so that media risks are dwarfed by unrelated preservation risks.” [19]

No new knowledge is needed for this.  One workable interface is the Internet File Transfer Protocol.  The principal barriers are costs of service and of straightforward engineering design to create an FTP server for each class of device wanted.

1B

Research into inexpensive and reliable archival media is required.

The low cost of and infrastructure for routine copying of files from aging to successor media, and the large cost of providing ‘archival media’ call this recommendation into question.

1C

Create generic devices capable of reading diverse classes of media.

Infeasible.

1D

Identify how the emergence of new storage devices will change digital entity encoding formats for content-based addressing and parallel processing.

No new research is needed.  Widely used architectures decouple the specifics of storage device design from the interfaces by which data are copied to/from computers’ main memories.  See text accompanying DDQ 2(3) Figure 1.

1E

Develop formal descriptive language for digital objects’ behavior so that users can test correctness of actual behavior.

Funding should be limited to proposals that promise significant progress beyond program specification and semantics languages work done between about 1980 and 1995.[20]

1F

Research agents and self-awareness among digital entities.

“Self-awareness among digital entities” is anthropomorphic nonsense.  It needs clarification into something clearly feasible.

1G

Research accelerated aging of media, systems and software.

We disagree with public funding for such work.

1H

Develop methodology to preserve the knowledge[21] inherent in digital entities and their interrelationships.

This seems to comprise two distinct needs: (1) behavioral issues addressed in 1E and (2) metadata capture for business object collections.

2

Re-engineer Preservation Processes (to reduce the human labor they require)

2D

Estimate the costs and efforts for large [preserved digital content] collections.

Is this really worthy of research funding?  It seems to require only routine software engineering attention.

2E

Create methods and tools with which users can estimate the completeness of a collection.

The notion, ‘completeness of a collection’, is mostly subjective,[22] depending essentially on the purpose of the user and being, for authors, an issue of scholarly merit.  Absent specific suggestions of broadly useful and answerable questions, we believe this recommendation inappropriate.

2F

Articulate the impact that new distributed storage strategies, such as grid storage, have on the naming, management, discovery, and delivery of digital resources.

This should not be supported by digital preservation program funding partly because such questions already need to be addressed by the proponents of the technologies in question, and partly because the questions are easily answered.[23]

3

Preservation of Systems and Technology

3A

Develop methods for preserving data stored with emerging formats.

The problem is solved in principle,[24]  but software engineering work is needed.

3B

Devise methods to preserve complex and dynamic data.

How to fix dynamic data has long been known.  Other sources of data complexity are covered by recommendation 3A.

3G

Develop an understanding of repurposing of digital content in the expectation of changing markets.

This topic is much broader than digital preservation, and will depend on domain expertise beyond the bounds of information science, computer science, and librarianship.

A Different View of the Research Challenges

Consider what someone a century from now might want of information stored today.  This person might be a scholar who wants to interpret our writings and to decide whether to trust them, a businessman who needs to guard against fraud, or an attorney surveying fiduciary records.  For some applications, information consumers will want, need, or even demand evidence that information used is authentic—what it purports to be, as represented by a firmly bound statement of provenance.  For every intended application, they will be disappointed by lost information that they learn once existed.  For every application, they will be disappointed by information that they can no longer read or otherwise use as they believe was originally intended.

In what follows, we try to emphasize objectively decidable aspects, separating these from subjective factors.  For any subjective factor, we believe it critical to identify whose decision is important.

Notice also that we emphasize end user needs—what people acting in well-defined roles might need or want to accomplish specific tasks—in contrast to the EU-NSF recommendations above, that are centered on how digital repositories might work.  In fact, as the reader will see in the description of Trustworthy Digital Object methodology, most of the new software needed for digital preservation is workstation client software rather than repository server software!

Figure 1: Information transmission channels, identifying human roles and intermediate object copies, 0 through 10, the names for document instance representations.

Figure 1 helps us discuss communication reliability challenges.  Since eventual users of preserved information might suffer harm or loss if they are misled, we pay attention to the potential distortions in the channel that transmits an input 1 to become a replica 9.[25]  This suggests the technical challenges of digital preservation—finding, demonstrating, and testing methods for:

·      Ensuring that a copy of every preserved document survives as long as it might interest someone;

·      Ensuring that consumers can use any preserved document as its producers intended, avoiding errors introduced by third parties that include archivists, editors, and programmers;

·      Ensuring that any consumer has the information to decide whether information received is sufficiently trustworthy for his use;

·      Hiding information technology complexity from end users (producers, archivists, and consumers);

·      Minimizing labor costs by automating clerical steps; and

·      Empowering editors to package information so as to relieve overloading of professional cataloguers.

For economic practicality, viable solution proposals must allow both repository institutions and also individual users to exploit already deployed and expected future technology[26] without disruption—technology offerings from third parties in an open market.  These must conform to software interface standards and conventions that permit “mix and match” from competing providers—standards and conventions that, over time, will be improved over today’s versions.

TDO Digital Preservation Progress

Thibodeau described the state of digital preservation know-how with: [27]

“The state of affairs in 1998 could easily be summarised:

·          proven methods for preserving and providing sustained access to electronic records were limited to the simplest forms of digital objects;

·          even in those areas, proven methods were incapable of being scaled to a level sufficient to cope with the expected growth of electronic records; and

·          archival science had not responded to the challenge of electronic records sufficiently to provide a sound intellectual foundation for articulating archival policies, strategies, and standards for electronic records.

We believe that we know an in-principle solution to every technical problem alluded to and that much of this insight is documented in a form permitting objective and specific critiques.  Before indicating where this work can be found, let us point out that we could not have progressed without explicitly focused attention to three well-known elements of scientific and engineering methodology: (1) careful attention to the interplay between the objective (here, tools that could be brought to bear) and the subjective (human judgments, opinions, and intentions that cannot flourish in too tightly controlled circumstances); (2) focus on the actions of individual people, rather than on the abstractions that we call “-ities” (authenticity, integrity, quality, …); and (3) “divide and conquer” into manageable pieces that build on and allow other people’s contributions.  For digital preservation, we see the following topics that interact relatively lightly and that can therefore be almost independently handled.[28]

Figure 2: TDO structure

 I.      Some number of socially communicated languages and standards that are not themselves parts of the technical solution, but that are needed starting points.

II.      Packaging (encapsulation) of a work together with metadata that includes provenance documentation and articulation of the links (references) binding TDO pieces with each other and with external packages. External TDOs are essential context for correct interpretation and evaluation of any work. (See Figure 2.)

III.      Topic-specific ontologies provided and maintained by academic and other professional communities.

IV.      A blob-encoding scheme to represent each content piece in language that is insensitive to irrelevant and ephemeral aspects of its current environment, and that therefore protects what is essential from the ravages of technology obsolescence and fading human recall.

V.      Repositories (a.k.a. Digital Libraries or Content Managers) that store packaged works, and that provide search and access services whereby information consumers can find and obtain what interests them.

VI.      Replication mechanisms that protect against the loss of the last remaining copy of any work.

Some of our documentation (work in progress since mid-2002) is, or will soon be, available on-line in preprint form, as follows:

What Do We Mean by Authentic? has been published (D-Lib Magazine 9(7), July 2003).  It shows what vernacular meanings of ‘authentic’—meanings that are different for different object genres—have in common and how to construct the objective definition needed for preservation work.

Trustworthy 100-Year Digital Objects: Evidence After Every Witness is Dead has been published (ACM Trans. Office Information Systems 22(3), 406-436, July 2004).  It continues to be available from the ERPAnet preprint server.  Focusing on the second challenge above, it describes the structure and use of TDOs (Figure 2), including some key architectural elements that will be implemented in XML:

(i)     Each TDO contains its own world-wide eternal and unique identifier and its own provenance metadata and is cryptographically sealed;

(ii)    External references are also sealed together with the identifiers of their referents;

(iii)  A network of certification keys is grounded in published and frequently changed keys of trustworthy institutions.  Final sealing of a preserved document by such an institution creates durable evidence of its deposit date.

Trustworthy 100-Year Digital Objects: Durable Encoding for When It's Too Late to Ask has been submitted for publication in the form available from the ERPAnet preprint server.  It teaches a method of encoding any kind of data whatsoever to be forever useful. This method would be applied to most kinds of content blob called for in Figure 2.  Its key ideas include:

(iv)   That we can and must enable information producers to separate irrelevant environmental information from information essential to each producer’s intentions, encoding only this essential information.

(v)    That extended Turing-complete virtual machines can represent anything that can be written;

(vi)   And that such machines can themselves be described completely and unambiguously.

Trustworthy 100-Year Digital Objects: Syntax and Semantics—Tension between Facts and Values has been submitted for publication in the form available from the ERPAnet preprint server.  It provides epistemological arguments justifying that the methods described in the immediately prior two papers do as much as mechanical methods theoretically can do towards preserving digital information, and that these methods attempt no more.  It further argues that the TDO methodology defines a quality standard against which any digital preservation method should be judged.

Trustworthy 100-Year Digital Objects: What's Meant?  Intentional and Accidental in Documents is half done.  We expect to post a preprint version on the ERPAnet server before the next DDQ number is published.  It will use early 20th-century philosophy to examine what information producers can do to minimize eventual readers’ misinterpretations, given that communication invariably confounds what it intends to convey with accidental information.

Request for critical reviews

The reader will surely notice that we point at no Web site for downloading software that would put the described ideas to work.  What we so far have are only limited prototypes.

In addition to the obvious administrative reasons for such a temporary shortfall, there is a compelling reason to "get it right".  The creation and use of a flawed preservation method would be accompanied by significant risk that the flaw(s) might not be discovered until many years later, and until after a large investment had been made into creating archival holdings that proved to have errors that sometimes distorted their meanings (for texts) or actions (for programs). 

We believe systematic errors to be of more concern than (mere) programming implementation errors.  Such systematic errors include questions that reach into epistemology—the philosophical theory of what is knowable, in contrast to what must forever remain questions of belief and/or taste.  We are therefore reluctant to build and release any portions of our projected solution until we believe that appropriate experts have examined and challenged our arguments.

We claim that correct TDO implementations:

·      Would allow preservation of any information that can be saved;

·      Would be as efficient as any competing solution (none has yet been proposed);

·      Could be brought into service without disrupting any repository service; and

·      Need not include any proprietary software.[29]

Therefore we request the most searching critical examination readers can provide of the work described, and communication of your views concerning our errors and omissions.  We would be happy with either private or public communication, and actually prefer public criticism over private.  “Getting it right” is simply too important for anything short of complete transparency.

Another Way to Make Documents Trustworthy

A remark whose source I do not recall (perhaps an Andrew Waugh article?) suggests a different method of making testable the authenticity of a preserved document.  If the same document has been independently stored in several individually credible repositories, its eventual consumer can test that the supposedly independent instances are sufficiently similar.

For this to be proof against fraud, there must be accessible unforgeable evidence that the document’s producer himself delivered each instance to a credible independent repository, rather than that a single deposited instance was copied among repositories.  This might be made verifiable by the firm binding of each repository’s credible assertion that it surely received its instance from the producer rather than from some third party—a provenance certificate for its holding.

Any reader who cares to do so can surely work out the details whereby a, repository can test, prove, and certify that the provider of a document copy is also its producer.[30]

Faintly Ironical

Suzy Palmer, Editor-in-Chief of Microform and Imaging Review, recently circulated a call for comments on an Association of Research Libraries (ARL) report, Recognizing Digitization as a Preservation Reformatting Method.  Its prefatory statement included, “Over the past several years, libraries have moved towards using digitization as an additional method for reformatting endangered and fragile paper-based materials to both preserve and provide access to library collections.”

Of course we believe the ARL move reasonable.  We nevertheless see the announcement as faintly ironical.  The irony is created by a preservation context replete with published hyperbole about digital documents being relatively fragile compared to paper-based documents.[31]

Query: What Was New in Digital Library?

A possibility for some future DDQ number is a description of the seminal architectural ideas behind digital libraries.  Samples of the insights that we have in mind are:

·        The “unit of work” notion for integrity of database transactions.  (I learned this from the IBM Research designers of the first relational database prototypes.)

·        That it would be necessary to combine file servers with database servers to obtain acceptable repository performance.  (I learned this in 1987 from David Choy, who had considered database system designs in the light of how IBM’s OS/MVS passed character strings between subroutines.)

·        IBM’s Data Links technology that permits a database management system to assume administrative control of files without requiring any change of existing programs that use and modify them.[32]  (This was invented in about 1993 by Luis-Felipe Cabrera.)

These examples illustrate that I am most familiar with IBM Research contributions.  I am concerned that I might overlook ideas from other sources, and plan not to publish the prospective article until I am confident that any blindness is remedied.  I therefore request readers’ suggestions of the seminal ideas that enable current and future digital library design.

Linux and Open-Source

LinuxWorld and Software Selection

Since I hope to escape the Microsoft near-monopoly some day, I have several times attended the annual LinuxWorld trade show in San Francisco.  I have yet to find what I’m looking for.

The August show included SW components of potential interest for every kind of document management service.  The resources and skills devoted to scaling, performance, economy, and reliability of repository components seem to be far greater than can be funded by NDIIPP, making it essential to design preservation solutions that leverage what others are already working on.

There is a mismatch—a semantic dissonance—between the language and expectations of many digital preservation community spokespersons and those of the technology vendors (e.g., with respect to ‘scaling’ in the research recommendations above).  Current emphasis among technology vendors is on components, whereas cultural depositories want customizable “solutions”.

Part of today’s commercial response is to offer “services”.  For instance, roughly half of IBM’s 2004 revenue will be from contract services, a business sector that hardly existed 5 years ago.[33]  This phenomenon contributes to another cultural mismatch: academic libraries are not emotionally, practically, or financially prepared to use such outside services, even though they do not seem to have sufficient internal skills for the middleware component of digital repository services.

What was offered at the LinuxWorld trade fair was confusing in the sense that I saw no broadly accepted model by which the components offered could be assembled into solutions.  Perhaps this is a passing problem, with “middleware” models yet to be invented and standardized—as has occurred repeatedly in the history of EDP refinement of lower component layers.  Several trade fair booths exhorted the need for layer interface standards.

Linux Desktops and Laptops: Has Their Time Come?

Red Hat recently announced a new desktop Linux variant, including corporate support, the GNOME interface, and the Evolution PIM.  The company expects the cost to be about $70 per desktop per year.

In Linux on desktop gaining OS race the SJMN technology columnist, Dan Gillmor writes, “Linux may be just fine for a second desktop system at home.  But for corporate road-warriors it's still not quite ready.”  See also a Linux Journal article and an ACM Queue article, Desktop Linux: Where Art Thou?

BusinessWeek and Ziff-Davis reporters visited the LinuxWorld trade show looking for Linux laptops and desktops.  Ziff-Davis sums up their disappointment with, “Early on this week, we thought this year's LinuxWorld would be a desktop lovefest. Alas, it appears we were too optimistic, …  so we'll have to wait even longer for a real Linux-based competi