Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 3, Number 1, 1Q2004

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2004, H.M. Gladney

 

ISSN: 1547-8610

 

 

Some problems are so complex that you have to be highly intelligent and well informed just to be undecided about them.                                                                                 Attributed to Laurence J. Peter

Digital Preservation

Preserving Office Records

Until now, DDQ preservation discussion has focused on works of individual authorship—cultural works.  It has almost ignored preservation of office records constituting business transaction audit trails that are often legally required.  The practices surrounding office records are different from those for cultural works.  For instance, what the (U.S.) National Archives and Records Administration (NARA) is emphasizing differs significantly from what the Library of Congress (LoC) is working towards.

What follows focuses on semantic and technical factors, ignoring other differences between office records and cultural works.  NARA digital record management plans are influenced by the following factors:

q      The content of business archives is evidence of the quality of its source agencies’ work.

q      Losing almost any archival collection would have readily identified legal and practical consequences.

q      Governmental collections are mostly not encumbered by third party copyright.

q      Since preservation is mandated, funding for NARA’s digital archive is relatively secure.

q      The cost of creating each office record tends to be much less than that for each cultural work.

Consider current activity at NARA and at LoC.  It should not be surprising that what is urgent in managing office records might be different from what is urgent for cultural works.  The topics and language of articles by NARA and its San Diego Supercomputer Center (SDSC) research partner[1] are so different from those in other digital preservation literature that some readers might find their papers difficult to understand.[2]  How can NARA be ready to acquire a large-scale digital archive system[3] when LoC seems far from understanding what digital preservation service it might need? [4] 

This puzzled me even though an explanation had been visible for three years.  Perhaps my focus on preserving cultural works blinded me to the different circumstances of office record collections.  The information flow of typical cases (Figure 1 and Figure 2) is sufficient to explain quite different technical emphases and software solutions.  Consider just two aspects among many: the prior history of a typical accession into a long-term repository and the tension between the content of each accessioned object and what that content is intended to convey.

For commercial and national archives, the accession unit will be a collection of related office records (e.g., correspondence between a foreign office and an embassy), records that individually derive historical context from their siblings and from collection metadata.  Each collection member is a ‘record’ in the sense meant by professional archivists, being information about a specific historical event whose context is communicated by metadata and by the member’s position among and relationship to siblings.  The metadata include format and content rules that often antecede individual records and that might include business control statements such as retention rules.

In contrast, a typical research library holding is a work of individual authorship that some professional cataloguer accessioned into the collection without much accompanying evidence of its historical significance or its relationships with other holdings.  Historical information and relationships are typically added to library contents to only a limited extent by a library employee (a cataloguer) and perhaps more comprehensively by scholars years later.[5]

Figure 1: Information flow for digitally preserved office records[6]

Figure 1 illustrates that each NARA accession is likely to be a collection that has been bounded by articulated rules and procedures refined over several years by a goverment agency, and that has been subject to administrative control and curation similar to that provided by archives.  The purposes and structure of each agency collection are likely to be documented, and the accessioning archivists will almost surely have opportunity to collaborate with agency administrators to refine metadata, to enhance ontologies, to determine the bounds of each collection, and to augment information about the collection significance.[7]  Individual office records are likely to be similar to other office records in the same collection, and such details are likely to be understood by records administrators.[8]  Research libraries are not likely to be provided similar information by the authors or editors of cultural works.

The Figure 2 author of a cultural work is likely to want to convey original conceptual structures, and also complex relationships with prior works.  Much of his effort will have been to represent mental constructs in ways that help readers achieve similar mental constructs.  In turn, a diligent reader will want to tease the author’s ideas from the written representation, even though this reader cannot converse with the author.

In contrast, people rarely care as much what an authoring bureaucrat thought as they do about the relationships of each record to other office records and to the agency’s objectives.  The written representation tends to be more important than authors’ intentions.  In some cases, the author’s thoughts about his output are administratively pre-empted by the content.  For instance, in contract litigation, the written words have unconditional priority over what the agreeing parties might have intended.

The specific words (symbols) used in office record collections are of interest, particularly if each is used similarly wherever it occurs.  This last is likely to be encouraged by agency glossaries.  Similar jargon might occur even without administrative encouragement because employees share culture.  Furthermore, the typical size of NARA accession units is larger than that of research library holdings.  For such simple reasons, an office record collection is likely to have many more occurrences of each pattern than any cultural work, and the number of relationship instances within such a collection is likely to be much greater than that within or between cultural works.

Figure 2: Information flow for preserved intellectual works6

Such circumstances tend to make ontological analysis interesting for office collections and suggest why ‘knowledge management’ is a high priority in SDSC investigations, whereas research librarians have long been mired in discussions about standardization of terms of reference and subject categories.[9]

Prior DDQ numbers have described how to ensure perpetual usefulness of individual digital objects and related authenticity evidence.  We believe the methods they describe will be useful also for office records, but do not yet understand the issues sufficiently to assert that it will be so.  Part of our difficulty is that the NARA/SDSC publications[10] provide neither examples of the knowledge rules they allude to nor collection examples that help the reader guess what these rules might be.

Ironic?

Pessimism about digital preservation seems to infect the research library and archive community.  Recent writings describe the circumstances as ‘ironic’, along the lines illustrated by the boldface type in:   

“The problem faced by [those] who aim to preserve history by preserving [digital] records is that [they] … may be as ephemeral as messages written in the sand at low tide …  It is ironic that the primitive technology of ancient times has produced records lasting hundreds of years, while today’s advanced electronic world is creating records that may become unreadable in a few years’ time.[11]

“The correct interpretation of records has always required knowledge of the language in which they were written, and sometimes of other subjects too ….  Fortunately enough of this knowledge has survived that we can make sense of most of the records that have come down to us. …  Just as interpretation of the 1086 Domesday Book depends on the dictionaries and grammars for medieval Latin painstakingly compiled by long-dead scholars, interpretation of contemporary electronic records … will only be possible if the necessary methods and tools are … preserved now.”                                       [Darlington]

Nobody has given persuasive reasons why ‘ironic’ might be apt.  We are left to guess why people say so.

The putative statistic on which 'ironic' is based involves an unreasonable comparison, viz., the fact that some old paper documents have survived, compared to the fact that some digital documents might not survive.  Consider the following historical and economic factors.

(1)     A plausible comparative statistic is ‘storage effectiveness’—integrating content amount over time.  Some paper has stored about 3000 characters per page for 500 years, i.e., for a 300-page book the retention has been about 5*10**8 character-years.  A hard disk drive (of roughly the same size, weight and price as a book) can be counted on to save about 100 gigabytes for at least 5 years, i.e., about 5*10**11 character-years.  With this measure, magnetic technology is 1000 times more effective as a storage medium than is paper.  (‘1000 times’ is a conservative estimate.[12])

(2)     Today's digital preservation quality measures are more rigorous (roughly <1 undetectable character error in 10**10) than those that have been or are still now expected for documents stored on paper.

(3)     It being unnecessary, we are unwilling to work as hard on modern information as did the “long-dead scholars” who “painstakingly” compiled ancient dictionaries and grammars.

(4)     Technology offerings respond to markets.  The marketplace has not asked for long-term retention.[13]  Instead, what people have asked of digital technology is fast search, fast access from a distance, and immense capacity—qualities neither expected of nor delivered by paper.

(5)     We began to share digital objects only 20-30 years ago.  Over roughly 2000 years society has built an immense infrastructure and invested heavily in education for using paper.  It's hardly surprising that such infrastructure and education have not yet been matched by digital equivalents—especially not for applications for which no widely expressed demand exists.[14]

(6)     Rather than long-term preservation, the commercial market for records management apparently wants controls and automation for discarding records as soon as the law permits and internal needs have been satisfied.[15]

(7)     In-depth professional discussions of digital preservation started only about five years ago.  Plausible solutions for the technical components have been identified in prior DDQ numbers[16] and elsewhere.

(8)     The Internet Archive is saving a significant fraction of Web-accessible data.  Its Recall search service  has indexed the text of over 10**10 pages.[17]  An IBM Research service called WebFountain™ has gathered a 500-terabyte database for analysis.[18]  There is little a priori reason to doubt that such collections can be made to survive forever.

Arguably, digital preservation lags digital access because society values rapid gratification over enduring value.  Perhaps Darlington calls the situation ‘ironic’ because National Archives personnel want quick gratification of their priority—durable copies.  If so, that would be ironic!

Progress in Digital Preservation Services

The Open Archives Initiative

Participation in the Open Archives Initiative (OAI) seems to be growing.  If you have not been following this, DDQ recommends Using OAI … Differently, The Expanding World of OAI, and also:

q     The specification of the OAI Protocol for Metadata Harvesting (OAI-PMH);

q     A Guide to Institutional Repository Software describing systems—ARNO, CDSware, DSpace, Eprints, Fedora, i-Tor, and MyCoRe—intended to allow an institution to implement an OAI-compliant repository without resorting to in-house technical development; [19]

q     OAISTER from U. Mich., a collection of freely available, difficult-to-access, academically-oriented digital resources that are easily searchable by anyone.  As of 6th February 2004, this held 3,016,267 records from 267 institutions;

q     The RoMEO Project (Rights MEtadata for Open archiving), a project to investigate the rights issues surrounding the 'self-archiving' of research; and

q     A position on Priorities for OAI Community.

PRONOM: File Format Information

The Public Record Office PRONOM service collects information about objects with the file formats of electronic data, about the software products required to create, render and migrate objects with these formats, and about supporting vendors.[20]   The PRONOM database is Web-accessible for reports in various formats.  It currently documents ~550 file formats, ~250 software products, and ~100 vendors.

Preserving Personal Pictures and Records

For the home computing enthusiast, PC Magazine suggests “ways to ensure that the contents of your discs are readable down the road and to set up a backup plan to keep your archives safe.” [21]  These guidelines seem sufficient for preserving digital photographs and personal data for 25 years or longer, i.e., at least until current preservation research matures into practical offerings.  Surely research libraries can work out practical equivalents for their scales and environments!

Epistemology and Software

Gödel's Mathematical Proof that God Exists

Attempts to prove the existence of God reach back at least to St. Thomas Aquinus.  A February 2004 Google search for Web pages containing (“existence of God”+”proof”) yielded 49,100 ‘hits’, including a Web page identifying over 300 'proofs'.  Bertrand Russell wrote, [22]

“Intellectually, the effect of mistaken moral considerations upon philosophy has been to impede progress to an ex­traordinary extent.  I do not myself believe that philosophy can either prove or disprove the truth of religious dogmas, but ever since Plato most philosophers have considered it part of their business to produce "proofs" of immortality and the existence of God.  They have found fault with the proofs of their predecessors—Saint Thomas rejected Saint Anselm's proofs, and Kant rejected Descartes’—but they have supplied new ones of their own.  In order to make their proofs seem valid, they have had to falsify logic, to make mathematics mystical, and to pretend that deep-seated prejudices were heaven-sent intuitions.”

A Web page details Kurt Gödel's 1970 Ontological Argument, a mathematical proof that caused a stir among Gödel's colleagues.  Consider an abbreviated version:[23] 

Axiom 1

(Dichotomy) A property is positive if and only if its negation is negative.

Axiom 2

(Closure) A property is positive if it necessarily contains a pos­itive property.

Theorem 1

A positive property is logically consistent (i.e., possibly it has some instance.)

Definition 1

Something is God-like if and only if it possesses all positive properties.

Axiom 3

Being God-like is a positive property.

Axiom 4

Being a positive property is (logical, hence) necessary.

Definition 2

A property P is the