Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 1, Number 1, 1Q2002

 

 

 

HMG Consulting

20044 Glen Brae Drive

Saratoga, CA 95070

(408) 867-5454

©  2002, H.M. Gladney

 

 

 

Whenever misleading information might cause serious business damage or loss, its users should reject it in favor of nothing but reliable sources.  Information obtained through any channel not known to assure authenticity is not dependable enough for critical decisions.  For instance, the World Wide Web is not reliable for important facts; they should be obtained or confirmed through other channels.  For applications where secrecy is important, people should be skeptical about e-mail from unverified sources or not sealed for confidentiality.

Information consumers do not yet seem much concerned by the time and energy needed to make prudent authenticity and provenance checks.  That might be because relatively little information is critical and because deliberate misinformation is still rare in digital documents.  It might also be that consumers are not being careful enough.  Sooner or later, however, people will become concerned with fraud risks.

As business, government, and cultural records migrate from paper to digital media, the importance of digital archives will increase.  Currently, enterprises considering creating and managing repositories know that a stored document might be important 5 to 100 years later, and that technical obsolescence might by then make it irretrievable in any meaningful way.  For instance, pharmaceutical development records must be held until the risk of lawsuits subsides many years after they are sold. [1]  For reasons that DDQ will gradually expose, doing this safely and inexpensively is not general practice today.

Starting with the current number and extending over others, DDQ will address the core challenges of digital preservation: data integrity and information trustworthiness.  Data integrity: if we save a digital record, this must be done so that the record can surely be used as intended, even if this occurs only many years after the record was generated, and even if its author is no longer available to explain missing details.  Information trustworthiness: we need to structure each saved document so that its eventual user can determine whether its content is sufficiently trustworthy for his or her purposes.

Introduction to Digital Preservation

DDQ will suggest the magnitude of the technology changes driving the current Information Revolution.  One consequence is that almost all business records and intellectual works are now created in digital form, even if they are later to be published on paper or audio-visual media, or realized in some material medium (such as buildings, highways, and machine tools).  Some digital records cannot be converted to print form without losing information.[2]

Currently available software does not include good tools for saving digital originals in the face of certain hardware and software obsolescence.  Most of what’s created is likely to become unusable in about 10 years.  If all that stuff is worth writing in the first place, surely some of it is worth saving! [HMG]  Starting below and continuing in its next number, DDQ will sketch opinions about parts of this concern. 

This first DDQ number suggests why taxpayers should pay attention.

·         Why is digital preservation suddenly urgent?  The U.S. Government recently granted a great deal of money to support it.  However, the needed technology and infrastructure are not in place.

·         What kinds of challenge need to be addressed?  The challenges include legal, policy, organizational, managerial, educational, and technical aspects.  Although much has been written since, the best analysis is [Garrett].  Perhaps the most difficult challenge is selection of what to save. 

·         Among these challenges, what are the technical components?  Only one fundamental problem impeded digital archiving until recently—how to preserve information through technology changes.  This was recently solved, but the correctness and practicality of the solution are still to be demonstrated.  The other challenges are merely engineering and solution deployment issues.

·         What is wrong with the proposed direction of prominent research libraries?  The published proposals focus on how archives should be managed, instead of on what’s needed to avoid losing documents and needed to make saved documents trustworthy.

·         Why might citizens be concerned?  The Library of Congress is to manage the appropriation mentioned above.  However, its ability to manage digital information prudently is called into question by a recent National Academies study.

Of the several dimensions of digital preservation, DDQ will focus on the technical aspects.  We will need several issues to do that justice. 

Why Pay Attention to Digital Preservation?  U.S. Government Support

In late 2000, little-known legislation provided generously for digital preservation work to be managed by the Library of Congress. [Approp]  If you inspect this legislation, you’ll see that the funds will be between $25M and $175M, depending on how successful the Library is in raising philanthropic grants.  In the best of worlds, this would be cause for rejoicing among people concerned for the cultural heritage.  However, essential technical methods are not yet agreed on, much less available, and the National Academies study [LC21] suggests that the Library lacks essential skills.

For research libraries that usually have only small discretionary funds, $100M is a very large sum.  Arguably, however, the most efficient technology for the purpose is neither proven nor even widely known in the professional community. 

As big as the risk might be that $100M is spent inefficiently, there is a larger risk—that the public learns, after the fact, that the project went forward using inferior technology; that would discourage future funding.

An Example of Technical Risk: “Trusted Systems”

The kind of misunderstanding that puts expenditures at risk is illustrated by Attributes of a Trusted Digital Repository, a Research Libraries Group proposal posted for criticism 9 months ago.  It builds on a careful analysis of an idea called the “Trusted Computing Base (TCB).”  TCB architecture was designed in the 1970’s for defense applications, rather than for delivery of mostly public information.  Using it is likely to lead the library community into adopting systems, infrastructure and internal methodology far more expensive than needed to achieve the objectives tabulated in the cited study, [RLG].

Why is this so? 

Warning bells should have been heard immediately, because TCB architecture was devised for military intelligence applications that must forestall security risks unlikely in civilian library applications.  A TCB is intended for reliable execution of arbitrary programs whose results cannot be independently validated, except perhaps by further expensive calculations.  Digital archives face no such risk.  Any digital security person would know this,[3] so an early question is, “Why were no experts on computing security consulted?”[4]

The reader might easily see for himself why TCB architecture is inappropriate.  A military surveillance system needs to be proof against infiltration that modifies its software to hide targets such as ground forces in Afghanistan.  In contrast, archives merely regurgitate what they were fed; they have only two critical kinds of output: reproductions of stored documents and search results.  A well-known security technology, called message authentication codes, would allow users and auditors to validate such outputs inexpensively.  I.e., TCB is expensive overkill.

A Bigger Problem Called “OAIS”

The Digital Library Federation (DLF)[5] is pursuing design and implementation based on a model called the Open Archival Information System (OAIS), but nobody seems to be asking whether this is the best available approach.  Like TCB architecture, OAIS comes from outside the research library community; it comes from space agency laboratories and furthermore was not intended by them to be an architecture, much less a technical design.[6]  The objectives, organization, management, and infrastructure of space agencies are radically different from those of research libraries.  OAIS might be an appropriate starting point, but it seems risky to accept this without inquiry.  Yet, among a score or more publications planning for OAIS use for research libraries, the question is not even asked, much less answered.

OAIS focuses on the processes within an archive rather than the archival product or service as seen by clients.  What archive users want is trustworthy information.  They do not particularly care whether any particular library service is trustworthy.[7]  Managing an enterprise to high standards is difficult.  It is even more difficult to demonstrate to auditors[8] and users that internal procedures are sufficient to ensure trustworthy output—particularly if trustworthiness might depend on internal procedures from the time when a document was deposited until when a user needs it 50 years later!

Managing the procedures inside an archive might be necessary to achieve reliability in an archive based on paper, but it is neither necessary nor adequate to achieve reliability for digital documents.  Proof of digital document authenticity and provenance can be achieved by message authentication codes signed by trusted institutions and included in the metadata integral to each document delivery.  E.g., while I would value a reliable British Library signature certifying the authenticity and provenance of some document I retrieved, I won’t much care whether the replica I received came from a British Library repository or from somewhere else.

The DLF projects centered on OAIS include further problems of omission.  Some were communicated to the authors of [RLG] months ago, but they have not yet responded to these criticisms.  It would be pointless to detail them here; the problems above are sufficient grounds for concern.

What Can Be Done About Such Problems?

Identifying problems is not enough.  Each problem identification should be balanced by a solution proposal.  We’ll take up examples in the next DDQ number.  For the moment, we merely sketch the forthcoming approach.

DDQ will propose focusing on making digital documents sufficiently durable and sufficiently trustworthy for archival purposes.  Much of the necessary technology is also needed for document interoperability between differing computer systems; that technology is progressing rapidly and seems to be fundamentally sound.  Making documents trustworthy will also depend on message authentication codes, which in turn depend on public key cryptography and secure key-exchange infrastructure.  All these technologies are well tested in practice and by expert evaluations.  The fundamental problem alluded to above—preserving information against technology obsolescence—has been solved [Lorie], who is building a demonstration vehicle.

If you trust the assertions in the prior paragraph, you’ll understand that a large part of what is still needed is to assemble the technical components mentioned into cohesive and easily used tools.  Such work should include prototypes that test and demonstrate that everything needed is present, feasible, economical, and convenient for archive managers and their clients.  Suitable pilot projects could be executed, including essential assurance tests, in about 18 months.  If the work were started promptly, deployment could begin on the time scale that the U.S. Government effort probably hopes for.[9]  However, no appropriate project has been funded, much less started.

When Was the Information Revolution?

"Prediction is very difficult, especially if it's about the future." — attributed to Niels Bohr, Nobel laureate in Physics.

 

The newspapers and trade press have for many years talked about the “Information Revolution”.  When did it occur, or is it yet to come?

Drucker[10] suggests that in IT (Information Technology), the “T” part has occurred and the “I” part is yet to happen.  He further suggests that we are currently experiencing a fourth Information Revolution. [Drucker]  Changes as disruptive as those of the inventions of engraving and of printing with movable type in the 16th century do not seem to have yet happened in this fourth Information Revolution.

As background for guessing what might unfold, Drucker reminds us of the history of printing.  He includes productivity estimates that depend partly on changing who did the work (monk-scribes were replaced first by printers and later by publishers).  Direct comparison would be difficult, because no corresponding modern social change has yet occurred and because no credible productivity estimates are available for knowledge workers. 

Instead DDQ starts below to describe the pace of technology change that is driving our own history.  We’ll see that, for the foreseeable future, the technologies are expected to improve at roughly the same pace as they did in the past decade. 

Physicists have a rule-of-thumb that we perceive an order-of-magnitude change[11] as a qualitative change.  The expected declining hardware prices may actually precipitate institutional and social changes.[12]  I.e., disruptive social change might still occur.  Some professional librarians seem uneasy about possible changes.[13]  They certainly have little reason to be confident that research libraries and archives will flourish in their current form, as you may come to understand from the discussion of digital preservation that will start below.  I’ve heard that enrollment in university library science departments has fallen, and that many of these are restyling themselves as faculties of information science.[14]  I.e., DDQ will discuss technology changes in order to alert its readers to possible social changes.

Understanding hardware technology changes has another purpose: deciding which system designs will be economical and perhaps optimal.  This number of DDQ will start a protracted discussion of digital preservation with an approach that might not be optimal today.  However, the expected hardware price changes will be seen to make certain proposed design elements practical in the very near future.  Specifically, using document storage space generously is becoming attractive.

How Quickly is Technology Changing?

The hardware price changes most easily understood are those of components that move almost directly from factories to consumers.  Reliable predictions exist for minicomputers.[15]  In contrast, telecommunications prices are difficult to track and predict because communication services are deployed through massive, multi-year infrastructure projects.[16]  Falling prices of persistent magnetic storage—hard disk drives (HDDs)—will revolutionize how documents are stored, organized, and used.

Cheaper HDDs tend to affect document management shortly after they become available in the marketplace.  What’s happening is summarized in a graph showing how much storage space $200 will buy.  Why $200?  Since desktop PCs sell for between $400 and $2000, nobody can save much money by reducing storage sizes below what about $200 will buy.  That’s partly because the manufacturing cost of HDD moving parts, electronics, and packaging is insensitive to recording density improvements.  The marketplace effect is that PC vendors steadily increase the storage capacity of their PC offerings.

Figure 1: How Much Disk Drive Storage Space Will $200 Buy?

Figure 1 shows steady cost/effectiveness improvement at 28% per annum[17], starting in 1991.  My former colleagues at the IBM Almaden Research Center, the leading laboratory for this technology, are confident that the same pace will continue for another five years, and are optimistic that the trend will continue for an additional five years.[18] 

By 2006, $200 will buy over one terabyte (= 1000 gigabytes = 1,000,000 megabytes), whose capacity is suggested by the following table:

Roughly how much can be stored in one terabyte?

Feature movies

4 Gigabyte each

250 films

Television-quality video

2 Gigabyte/hour

500 hours

CD music

560 Megabyte/disk

1800 hours

Medical X-rays

10 Megabyte each

100,000 pictures

Scanned color images

1 Megabyte each

1,000,000 images

Scanned B/W documents

50,000 bytes/page

20,000,000 pages

Encoded text pages

3300 bytes/page

300,000,000 pages

Your reaction might be, “What would I ever do with that much space?”  The corresponding industry concern is, “How can we keep existing factories profitably busy?”  I first heard that question from an IBM executive in 1980.  Since then, capacity increases have been matched by new applications, and the dollar size of the storage industry has increased.  Although nobody can be sure that interesting applications will grow as fast as industry capacity, I’ve heard no pessimistic projections recently.

For DDQ, one important conclusion is that consumers need concern themselves little with how much space documents take.  Software designers should allocate generous document and catalog sizes when doing so reduces human administration, improves performance, or provides attractive new services.

Basis in Philosophy, Especially Wittgenstein’s Thinking

A family member suggested Ludwig Wittgenstein: The Duty of Genius [Monk] because of a personal connection with its author.  This outstanding biography gradually drew me into Wittgenstein’s thinking.  I was surprised to find that it helped with software design, even suggesting a formal patent claim. 

I should not have been surprised.  Computers manipulate symbols that are surrogates for what they mean.  A computer model is good if its pattern follows the pattern of what it stands for.[19]  Wittgenstein’s ideas[20] illustrate with example after example the relationship of language to meaning, teaching that language consists of symbols taking meaning from how they used.

DDQ will show philosophy to be helpful with digital document quality issues, starting with the distinction between trustworthy and trusted for stored documents.  “Trusted” seems related to confusion about what automatic machinery and clerical procedures can provide, and what must be left to human judgment.4

In some later number, DDQ will deal with “trustworthy” and “trusted” carefully, arguing that much written about “trusted systems” is flawed for reasons illustrated Wittgenstein’s analysis of “intends”. [LW 39, lecture 2]  For the moment, it’s enough to suggest that whatever we do to preserve digital documents probably should satisfy criteria defined by professional archivists.[21]  Two excerpts capture what the managers and auditors of archives strive for:

The concepts of reliability and authenticity as expressed in archival discourse are posited on a direct connection between the world and are rooted, both literally and metaphorically, in observational principles.  A reliable record is one that is capable of standing for the facts (fact: a thing done; an action performed or an incident happening; an event or circumstance; an actual occurrence; an actual happening in a time or space or an event mental or physical; that which has taken place.  A fact is either a state of things, that is, an existence, or a motion, that is, an event.  Black’ Law Dictionary, 6th ed. (St. Paul, 1990).) to which it attests.  Reliability thus refers to the truth-value of the record as a statement of facts and is assessed in relation to the proximity of the observer and recorder to the facts recorded.  An authentic record is one that is what it claims to be and that has not been corrupted or otherwise falsified since its creation.                                                 [MacNeil, p.40]

  the methods for assessing the truth-value of records as evidence are rooted in a particular way of looking at the world and in a particular conception of records as a kind of testimony about that world.  The criteria they establish for determining what counts as true are themselves the product of historical, cultural, and political choices …                                        [MacNeil, p.45]

Notice subtle but critical shifts between the world of facts and the world of values.  A record should be “capable of standing for the facts”, i.e., it should be isomorphic19 to some facts.  However, what is worth recording is “the product of historical, cultural, and political choices”—an issue of values.  Every value judgment is a potential source of mistrust.  A good archival system will avoid value judgments whenever possible and will transparently show its mechanisms, leaving the remainder as value claims that the eventual user of each stored datum needs to accept or reject according to her own values and needs.

 

PC Tips for Document Management

Everything under the sun has been said before.  However, since nobody listened …  — author unkown[22]

For my own computers, I purchase hard disk drives only from IBM, Quantum, and Seagate because the industry press has for 5 years consistently rated their reliability as excellent.  The best price I have seen for consumer-grade hard disk drives (HDDs) is $1.11 per Gigabyte ($89 for an 80Gb Seagate HDD).

Serious readers and writers should consider acquiring the Opera Browser (http://www.opera.com/) and the SnagIt “screen scraper” (http://www.techsmith.com/).  I chose Opera after trying out several browsers, including using the Internet Explorer and the Netscape Browser extensively.  Opera’s on-line advertisement is trustworthy; I particularly like a time-saving feature—that Opera overlaps downloading big files with user’s think-time.  I have not tried SnagIt competitors, because SnagIt pleases me.  Each of these products permits extensive trial use, and is inexpensive even if you decide to buy its deluxe version.

Feedback Requested!

If DDQ says something for which you would like a better explanation, please let me know by e-mail.

The glossary is a work in progress.  If you want a definition that’s missing, please let me know that too.

Above all, if you disagree with a conclusion or a position that I take, raise an objection.  We are both likely to learn from that and from whatever response I attempt.



[1]        The implications are illustrated by U.S. regulation 21 CFR 11 issued by the Food and Drug Administration in 1997.  21 CFR 11 specifies authenticity criteria for digital records in lieu of paper records.

[2]        For instance, a printed copy of this document will not and cannot include the hypertext linking signaled by colored text in the on-line version, and breaks the Internet connections that let you call up cited work.

[3]        A Google search on (TCB + computing) exposes several Trusted Computing Base articles in its top 10 hits, including an article Is the Trusted Computing Base Concept Fundamentally Flawed?

[4]        In fairness to the authors of [RLG], it’s arguably not a mistake originating in their professional communities.  The notion of “trusted systems” comes from computer scientists [Stefik], and is thought misleading by many computer scientists.

[5]        The Digital Library Federation (DLF) is a consortium of 28 of the best-known research libraries of the United States and is influential much more broadly.  See http://www.diglib.org/about.htm.

[6]        Future DDQ numbers might explore OAIS at greater length than this first one.  The OAIS source document defines it to be a “reference model” and explicitly states, “This reference model does not specify a design or implementation.”  The role of a reference model is to establish a common language and picture (“ein Bild” in [LW 21]) in order to minimize misunderstandings among designers—precisely the focal problem of Wittgenstein’s 1939 lectures. [LW 39]  Nevertheless, the WWW now has an “OAIS implementers group”, but no architecture or design document seems to be published.

[7]        This does not imply that librarians and other information service agents can ignore repository trustworthiness, which might reflect on the reputation of their institutions.

[8]        The librarians and archivists would probably not be pleased by outsiders inspecting their working procedures, particularly if they understood that the objectives could be met without such external interference.

[9]        As far as I know, no schedule has been published.

[10]       Drucker’s The Age of Discontinuity, published in 1968, is remarkably prescient about IT and business evolution that occurred over the next 30 years.

[11]       This may not be true for a parameter whose scale is imperceptible to human beings.  We might hardly notice a 10-fold change in the size of the universe, but would be devastated by a 10-fold change in the size of the sun.  Our sensitivity to some parameters is extreme, viz., the importance of a change of 1°C in the mean temperature of the earth.

[12]       DDQ comments on economic and social factors are pertinent only to the wealthy North American and European countries.

[13]       E.g., see the Digital and Traditional Libraries discussion thread in the http://infoserv.inist.fr/wwsympa.fcgi/arc/diglib archive.

[14]       Physical scientists joke, “Anything with ‘science’ in its name ain’t one.”  Certainly “computer science” is an engineering discipline.  It draws on physical sciences and mathematics, but that doesn’t make it a science any more than civil engineering is a science.  What “information science” is or might become is unclear.

[15]       The economics of electronic components used in other machinery is more difficult to track because it’s obscured by business practices that include consulting, software, and other services.

[16]       Just how difficult it is to track and estimate costs and prices in communications is illustrated by the agonies of the fiber-optic network companies.  According to the trade press, they are vastly overbuilt and much of the fiber will never be used, but simply be left to rot in the ground.  Arguable, even those best positioned to know costs, applications, and markets “got it wrong.”  However, a fair appraisal would take into account the distortions caused by “winner take all” markets.

[17]       Notice that the vertical axis is logarithmic; exponential growth plots as a straight line on a semi-logarithmic chart.

[18]       Thompson and Best [Thompson] provide a good projection of the technical factors behind what Figure 1 conveys.

[19]       Jargon for this includes sentences like, “Information is represented by an XML document.”  A model that faithfully represents what’s intended is said to be isomorphic to the facts.

[20]       Wittgenstein’s own writings are few and short, but this is compensated by the works of his disciples and interpreters.

[21]       Archivists’ criteria are remarkably similar to Certified Public Accountants’ criteria.

[22]       Although I had thought this came from Voltaire, I have been unable to verify that.  I’d much appreciate someone telling me who first said it.