Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 1, Number 2, 1Q2002

 

©  2002, H.M. Gladney

DDQ Table of Contents

Citations

Glossary

* 

HMG Consulting

20044 Glen Brae Drive

Saratoga, CA 95070

(408) 867-5454

 

 

 

 

 

 

The current issue continues the DDQ 1(1) focus on preservation of digital documents, a topic discussed in an increasing number of reports, lectures, and workshops.  Most of these merely convey the existence and nature of the challenge; a smaller set focuses on remedies based on collection management and institutional procedures that probably need to change. 

In contrast, DDQ emphasizes technical measures centered on representation of individual documents,[1] responding to basic interests of consumers—that they can prudently decide whether or not information received is sufficiently trustworthy for how it is to be used.[2]

We believe that:

     Sufficient fundamentals are known for making any digital document survive forever in useful form.  However, considerable engineering effort will be needed to provide practical and economical prescriptions for repository managers and end users.[3]

     Basing institutional practices on the solution for individual documents will achieve economies not inherent in the current research library consensus direction. 

     Achieving trust by audit trails built into digital documents will be surer and cheaper than seeking this objective by defining institutional procedures certified by outside committees.

     Current preservation research insufficiently emphasizes well-known engineering practices that include forestalling potential failures, objective assertions about system properties and document authenticity, and technology that helps any end user test such assertions.

     Current uncertainties and confusions can be reduced by clear distinctions between assertions that are facts and those that are opinions.  Early 20th-century philosophy teaches the extent to which such distinctions are possible and the pertinent limits of language.  To proceed without this foundation is hazardous.[4]

The reader might correctly expect careful exposition of these issues to be both lengthy and tedious, because the topic at hand is preserving forever whatever kinds and amounts of information people might choose.  Doing so reliably requires that we anticipate and forestall every likely failure.[5]  However, while the engineering needed for a comprehensive and sound solution is likely to seem slow and expensive, the importance and applicable breadth of reliable solutions more than justify the anticipated effort.  In fact, accomplishing quality design before investing heavily in preservation will pay for itself many times over.[6] 

DDQ avoids burdening any reader with more detail than (s)he cares for by a top-down, divide-and-conquer exposition that parallels common engineering practice for large artifact design.  DDQ 1(2) enlarges on this.

Recent Reports

Acquiring a clear and comprehensive grasp of digital preservation can be time-consuming.  The literature is voluminous and highly redundant,[7] but it must be inspected if one wants to catch everything new and significant.  However, a few recent reports, considered together, convey the state of the art and also articulate the dominant opinion of how to proceed. 

An NSF scientific computing proposal is on a different topic, but requires critical attention.

(1) U.S. National Strategy for Digital Preservation

The National Digital Information Infrastructure and Preservation Program was established by Congress to "develop a national strategy to collect, archive, and preserve the burgeoning amounts of digital content, especially materials that are created only in digital formats, for current and future generations."  As an initial step, the Library of Congress scheduled a series of meetings with groups from the technology, business, entertainment, academic, legal, archival, and library communities.  Background papers were commissioned to give "environmental scans."  These papers, collected as Building a National Strategy for Digital Preservation: Issues in Digital Media Archiving [CLIR/LC], are available on the new NDIIPP web site. 

(2) U.K. Cedars Final Report

The CEDARS project (1998-2002 at Leeds, Oxford, and Cambridge Universities) aimed to provide best practices guidance for long-term digital preservation.  In May it completed with:

       The Cedars Guide to Intellectual Property Rights

       The Cedars Guide to Preservation Metadata

       The Cedars Guide to Digital Collection Management

       The Cedars Guide to Digital Preservation Strategies

       The Cedars Guide to the Distributed Archiving Prototype

These reports collect British opinion on digital preservation, but contain little that is not better expressed elsewhere.

(3) Glasgow Effective Records Management

[Currall] teaches aspects of digital repository service in ways that are ready to be applied, and can therefore be recommended.  This report discusses:

       identifying the roles and responsibilities of records creators;

       creating, using, disseminating, and eradicating digital records;

       training users in digital record management practice;

       improving retrieval speed for documents and elements within them;

       increasing accuracy for items within a document and individual document choice from a collection; and

       reducing organizational risk from unmanaged records.

(4) Prototype Digital Archive Technology at SDSC

The U.S. National Archives and Records Administration has for about two years sponsored prototype archival storage management development at the San Diego Supercomputer Center.  There’s enough meat in this that I have not yet absorbed its teachings and implications.  [Ludasch] will interest the technically-inclined reader.

(5) “Trusted Digital Repositories …”

DDQ 1(1) commented critically [HMG2] on the mid-2001 draft of a Trusted Digital Repositories report [RLG], partly because its reliance on “Trusted Computing Base (TCB)” ideas would lead to expensive overkill.  The final 68-page report [RLG2] corrected this, but did not respond to other problems communicated in October. 

     Certification of archive procedures[8] is inherently more expensive and less reliable than basing an archive on stored objects that individually embed their own audit trails;

     Technology cost trends favor data replication and other storage design elements not considered, such as not requiring the centers of trust to be the only repositories; and

     [RLG] did not consider how many documents might deserve archival preservation—a number potentially so large that procedures proposed might be unaffordable.

The implications of the word “trusted” are taken up below.  Beyond that, since the issues raised by [RLG 2] are too tricky for my analysis to be complete, I’ve posted a slightly edited version of the October critique. The interested reader might think through all this himself while we consider how to continue.[9]

(6) NSF Scientific Computing Proposal

In April, an NSF Advisory Panel released Revolutionizing Science and Engineering through Cyberinfrastructure.  This first draft has flaws, such as insufficiently identifying what will be accomplished with the $650M it calls for.[10]  Even though the panel requested comments by 1st May, it is likely to address criticism received later.  In view of the amount sought from taxpayers, you may wish to inspect the proposal and comment.

(7) Promising Items Just Received

Towards a Continuing Access and Digital Preservation Strategy … is a U.K. call for comment. 

By e-mail I received a 2001 NASA ESDIS Data Center Best Practices and Benchmark Report. 

OAI has released the Protocol for Metadata Harvesting v.2.

What is the Information Revolution?

DDQ 1(1) considered When Was the Information Revolution?  We might further wonder what changes have happened, are happening, or will happen, and how they should influence decisions for digital archives.  Four years ago, href="ddqcites.htm#Neal98">Neal] ventured an academic librarian’s perspectives; shortened, reorganized, and adapted for the passage of time and a somewhat different viewpoint, a subset of his 25 cataclysmic changes in information exploitation is:

·         Individualized expanding power to access, communicate, and analyze information.

·         Office tools automating the mechanical aspects of writing and enabling self-publication.

·         Knowledge management technology making data gathering and management valuable.

·         Hypertext reducing the tedious drudgery of following bibliographic references.

·         Virtual reality enabling experiential entertainment, education, and engineering.

·         Digital storage enabling storage and search of vast multimedia resources.

·         Networks expanding personal communication and information sharing.

·         Push transmission enabling massive customized communication to select audiences.

·         Cellular communication freeing us from constraints of location.

·         Intellectual property made valuable and more at risk by easy copying and distribution.

·         Security and encryption providing us all with tools to control information access.

·         Lowered barriers easing market entry and experimentation for innovative services.

·         Self-service (such as ATM banking) for enterprise efficiency and customer convenience.

·         Outsourcing permitting us all to shed routine tasks in order to focus on core competencies.

·         Partnerships with more cooperation and sharing as essential for success and efficiency.

·         Concentration in which organizations absorb each other seeking market advantage.

·         Student population shifts to include social groups that have not been much involved.

·         Global awareness with rapid communication international­izing all aspects of life.

·         Political and social volatility enhanced by easier, faster information use and misuse.

These factors are stimulating organizations to plan new services.  Enterprise survival is often an issue.

The list reflects the writings, talks and informal conversation of librarians and scholars.  Is it in any sense complete?  Probably not.  For instance, it does not hint at induced employment shifts.

If any such perception[11] will be used to guide policy and action, it merits careful investigation to estimate its economic significance and influence on decisions, e.g., for managing libraries [Wolf].  We have too many opportunities and not enough skills and resources.  We need to decide which designs have greatest leverage and will most please users—both end users and service personnel.  We need to make choices and would do this better with more quantitative estimates than have been published.

While quantitative estimates are needed, even the qualitative nature of Neal’s list merits more inquiry.  For instance, for the Information marketplace, Neal wrote, “The information as commodity revolution is increasingly viewing data and its synthesized products, knowledge, as articles of commerce and sources of profit rather than property held in common for societal good.”  This can be read as a social critique demanding remedy—and some people do so with evident emotion.  However, this interpretation should be balanced by recognition that the information business is the livelihood of an increasing population fraction and also that the information easily available without payment is also increasing rapidly.[12] 

Fundamentals

“A good archival system will avoid value judgments whenever possible and will publicly expose its mechanisms, leaving the remainder as value claims that the eventual user of each stored datum needs to accept or reject according to her own values and needs.” [DDQ 1(1)]  We are seeking an archival document representation scheme that can express everything that admits unambiguous, objective, and testable representation—and nothing more, leaving to other language all values and judgments that must forever remain debatable.[13]  Whether or not a document will be trusted is such a value judgment.

Testable is critical.  Every important property of each archived document and of each archiving system component must be explicitly specified (asserted), and the archived data must include whatever is needed to test these assertions.[14]  These assertions and test procedures—what certified public accountants call audit trails and audit tools—must be firmly bound to each document.  The assertions will nearly always need to identify “whodunit”, and the “whodunit” information will itself be documents or document portions that require their own audit trails.  This collection of documents and procedures is what we mean by external evidence of authenticity and provenance, and in combination with internal evidence[15] is what can make the document trustworthy.  An eventual user will trust a document if such associated evidence is accessible, reliable, and sufficient for the use that (s)he intends to make of the document content.

This might seem a prescription for endless tests that themselves must be tested.  How can we avoid recursive explosion—a chain reaction that blows up in our faces?  We can share each individual fact and test among many objects and can end each recursion with facts that are widely known and trusted.[16]  The technical mechanisms will use identifiers, references, pointers, links, and XML namespaces, and the document representations will be isomorphic to directed acyclic graphs rather than merely trees.[17]  DDQ starts to describe the technology below.

How Can We Use Wittgenstein’s Philosophy?

Who is to decide what evidence is important and what tests are worth applying?  That has to be the end user—the person for whom the document in question is preserved.  This will invariably be an economic decision—a weighing of the cost of evidence testing against the costs of a wrong decision.  A role of technology is to minimize these costs.

For this we want the soundest possible foundation and believe we can do no better than a basis in Wittgenstein’s work.  He provided a nexus in the theory of language and logic (and therefore computing), arguably completing inquiries by Bertrand Russell, Emmanuel Kant, Ernst Mach and many lesser-known central European philosophers, and setting the stage for later work by Rudolph Carnap, Kurt Gödel, Alan Turing, and John von Neumann.  The famed Wiener Kreis analyzed the Tractatus Logico-Philosophicus (TLP) twice, and later investigation rarely reached backward earlier than TLP.

We might want a concise and complete prescription for communicating meaning, but trying to compress into a few words what hundreds of philosophical writings have not completed would be foolish.  Nevertheless, a focus on Wittgenstein’s notion of a rule may help towards understanding how his teachings apply to digital preservation methodology.[18]  We cannot convincingly say why Wittgenstein’s teachings are the best available ground, but we can show what use we make of them.  What the reader should look for to evaluate DDQ analyses is how it uses thinking taught by Wittgenstein.  Specifically:

     We are sensitized to the limits of language and the consequences of misunderstanding communications expressed in words.  In particular, potential misunderstandings between different professions figure prominently.

     We look at how authors use key words, and try to decide whether each sentence is a statement of a fact or an opinion.

     Such distinctions help us stay within the engineering role, avoiding inadvertently infecting technical solutions with value judgments that clients might find inappropriate.

     The limits of what can be automated or specified as clerical tasks are rules that expand into a finite number of steps for any particular objective.

     LW’s discussions of language suggest that we cannot articulate meaning any better than is accomplished by current ontologies (reference models) and the RDF language.

The critical distinction is perhaps best expressed by Paul Engelmann’s metaphor:

"Positivism holds—and this is its essence—that what we can speak about is all that matters in life, whereas Wittgenstein passionately believes that all that really matters in human life is precisely what, in his view, we must be silent about.  When he nevertheless takes immense pains to delimit the unimportant [i.e., the scope and limits of ordinary language], it is not the coastline of that island which he is bent on surveying with such meticulous accuracy, but the boundary of the ocean."    [Janik, p.191]

Although we cannot define the boundary in words, we can illustrate it by tabulating word pairs:

Island (terra firma)

Ocean

Comments

objective

subjective

We can relate any subjective assertion to some objective one, e.g., “DDQ writes that the digital preservation literature is highly redundant.”

Natural Philosophy

Ethics

Until academic topics became highly differentiated early in the 20th century, these were the major categories of learning.

physics, mathematics, logic

metaphysics

The Greek meta- means beyond.  I.e., metaphysics is the part of philosophy that deals with value judgments.

facts

Values and opinions

Unfortunately, value has many fundamentally different meanings; here the sense of “value judgments” is intended

die Darstellung → representation

die Vorstellung →
idea

English translations of Wittgenstein’s die Darstellung are sometimes misleading.

evidence

opinion

evidence comes from the Latin evidens—visible, clear, evident; also from ex + vident—according to + what they see

  

  

Many more word pairs might appear here.

form or syntax

meaning or semantics

In modern markup languages, XML deals with form and structure, RDF with meaning.

action

intention

[LW 39, II] treats “intention” carefully.

intensional

extensional[19]

Sets and sequences are communicated either intensionally—“(3n+1) where n is a natural number”—or extensionally—“1, 4, 7, 10, …”.

“Prove it!”

“Let your light so shine before men, that they may see your good works, …”[20]

Wittgenstein’s view about what cannot be said, but must be shown, is foreshadowed in the Bible.

  

  

Still more word pairs might appear here.

rules, procedures

intuitions

Whether a man “knows or not is simply a question of whether he does it as we taught him; it’s not a question of intuition”. [LW 39, II]

possibly computable

surely not computable

The technical word in mathematics and computer science is decidable rather than computable.

trustworthy

trusted

The creator of an object can make it trustworthy, but only its end user can judge whether or not it is to be trusted.

Imagine having to write an essay comparing the value of Shakespeare’s Hamlet to that of Lucasfilm’s Star Wars.  Your statements might oscillate from side to side of the boundary.

Originators, editors, and librarians can make a preservation document trustworthy (deserving of trust) by binding metadata and test procedures to it reliably, i.e., so that the binding is firm over time and not itself susceptible to undetectable modification.  Whether or not the document will be trusted by its eventual user will depend both on how well those who prepare it choose and bind the metadata and tests and also on the user’s judgment.

Trust, Trusted, Trustworthy 

The very naming of [RLG 2], “Trusted Digital Repositories”, is troublesome because it sets a trap for the unwary—a sort of false advertising.  “Trusted” should not be used when what is meant is “trustworthy”. 

We might have passed silently over this point were it not for pervasive confounding of objective and subjective aspects of digital preservation.  Trust is of fundamental importance in document delivery services, not only for scholarly work [Lynch], but also for business transactions that will grow to include pharmaceutical development records and perhaps even personal medical records.[21]  Some business applications are tempting fraud opportunities.

The [RLG2] authors did not invent sloppy use of trusted.  The term appeared in 1975 in the Trusted Computing Base (TCB)—a processor kernel whose behavior could not be improperly altered by any executing process—intended for critical control and cryptographic services.[22]  The designers called the kernel trusted because they also built the trusting entity—an operating system that needed the TCB to meet defense security criteria.  This operating system was itself protected by the kernel against modification, and was certified as secure by teams other than its implementers. [NCSC2]  To meet its objectives, the secure operating system needs to trust the kernel; i.e., we can say the TCB is trusted by the operating system.

In contrast, the relationship between a digital repository and its users cannot be engineered to ensure that the repository is trusted by its users—only that it might deserve their trust.

By 1990 people had apparently forgotten why trusted made sense only within systems that included the user as a controlled component.  Xerox used the terminology for network printers that were supposed to enforce rules for sensitive documents.  As far as I know, these machines were flawed and failed in the marketplace, perhaps because the intended customers understood the problem.  But the misuse of trusted persisted and was amplified in [Stefik] and related works; Stefik evaded the offer of responding to criticism raised in 1997 [HMG3].

“Trusted Systems” are not necessarily trustworthy, and trustworthy systems are not necessarily trusted by their intended users.  Misleading language impedes achieving the trust a repository needs to be effective.  An eloquent argument for correct English usage in this is:

`I don't know what you mean by "glory,"' Alice said.

Humpty Dumpty smiled contemptuously. `Of course you don't—till I tell you.  I meant "there's a nice knock-down argument for you!"'

`But "glory" doesn't mean "a nice knock-down argument,"' Alice objected.

`When I use a word,' Humpty Dumpty said in rather a scornful tone, `it means just what I choose it to mean—neither more nor less.'

`The question is,' said Alice, `whether you can make words mean so many different things.'

`The question is,' said Humpty Dumpty, `which is to be master—that's all.'

                                                                                                                                       Lewis Carroll, [Carroll], p.213]

Lewis Carroll was fully aware of the pro­fundity in Humpty Dumpty's whimsical dis­course on semantics.  … the point of view known in the Middle Ages as nominalism; the view that universal terms do not refer to objective existences but are [merely] … verbal utterances.  [This] view was skillfully defended by William of Occam and is now held by almost all contemporary logical empiricists.

Even in logic and mathematics, where terms are usually more precise than in other [disciplines], enormous confusion often results from a failure to realize that words mean "neither more nor less" than what they are intended to mean.   

On the other hand, if we wish to communicate accurately, we are under a kind of moral obliga­tion to avoid Humpty's practice of giving private meanings to commonly used words.

                                                                                                                    Martin Gardner, [Carroll, p.213, note 11].

Digital Document Preservation

Like the [RLG 2] authors, our long-term objective is “to reach consensus on the characteristics and responsibilities of … digital repositories for large-scale, heterogeneous collections held by cultural organizations.”  We believe that everything essential to trustworthy and durably useful preservation of decidable[23] information is known in principle.  The rest of the work needed is engineering reduction to economical practice.

Digital document preservation overlaps only partially with collection management thinking that is the focus of [OAIS] and writings that depend on [OAIS]. (Figure 1)

Figure 1: Digital document preservation and cultural collection management ([CLIR/LC], [RLG 2]) are topics that overlap only incompletely. [24]

Any document or blob can be represented by a bit-stream—a sequence of 0’s and 1’s.  Computer programs are specialized documents.  Collection preservation will be achieved if we:

¨       Save the bits so that somewhere a copy survives and that copy can be found.

¨       Ensure that the bits can be interpreted.

¨       Make the bits trustworthy by reliably associating sufficient metadata.

¨       Include library content lists among the set of saved documents.

If these requirements are met, digital libraries can be constructed or reconstructed.

For simple document types, extra work to ensure long-term interpretability is not essential—merely cost-effective.  Document collections can to contain sufficient redundancy for digital archeology—rescuing content from obsolete technology—when the content is wanted.  Choosing to prepare a document for retention is an economic decision that depends on the expected number of retrievals and whether one is willing to expend on behalf of unknown future beneficiaries.  For archivists, digital archeology is almost a “do nothing” tactic that leaves most of the work to whoever is interested in each saved document.

Computer program preservation cannot depend on digital archeology because programs rarely contain sufficient redundancy. 

We believe that progressive dissection of the above solution components will expose no unsolved problem.  However, we will not be sure of this until dissection is sufficiently advanced, and we will not be able to persuade everyone we would like to persuade.  The question is whether the technology for every kind of information is covered without further invention. 

Absent an orderly dissection and analysis to sufficient depth, mistakes or shortfalls in document schema or handling procedures might not be discovered until the damage they cause can no longer be corrected, risking information loss.  Due care would include written reliability and security analysis of schemas and programs—analysis more careful than is common.[25]  Ideally, every risk would be examined and minimized.  Complete analysis is unlikely.  Even so, careful analysis will be tedious and at times arcane.

 “Top Down” Engineering

The circumstances as well as the technology lend themselves to top-down analysis.

     The topic is large, embracing all kinds of information; some kinds call for prompt treatment, and new kinds are likely to be designed.

     Designing well for each kind of information before its instances are archived will minimize overall costs, i.e., up-front design is cost-effective for designs that are used.

     Different communities will be interested in different branches of the analysis, and in different depths; almost nobody will want to examine everything at the same time.

     Integration of one’s own work with components independently designed by others is important; top-down design facilitates engineering trade-offs to adopt the best features.[26]

     We understand only poorly which questions people will want answered and for which they will regard the answers as obvious.[27]  A top-down approach facilitates dialogue about whatever portions may interest people.

An option is writing aphoristically like Wittgenstein in TLP,[28] inviting challenges to any listed proposition.  In that spirit, DDQ will welcome criticism of anything it includes.

Cultural Chasm[29]

Top-down analysis is commonly used in the information technology professions, but might not be a comfortable method in the other professions concerned with digital preservation. 

Effective communication, not easy for abstract topics even between people who know each other well, is made extraordinarily difficult between professions by differences of jargon,[30] of expectations, of conventional forms and manners, and of value priorities.  Research librarians and their close associates seem to value consensus extraordinarily highly, and often seem uncomfortable with open debate.[31]

In contrast, scientists and engineers tend to value their personal sense of correct and elegant design[32] far more highly than consensus within any community.  They often seek and welcome vigorous debate.  In fact, they value exposing unproven propositions to criticism and believe this practice contributes to progress.  It works well when criticisms are directed at ideas rather than at the people that voice them.[33]  For instance, an IBM Research colleague often tells lecturers, “That’s stupid!”  His outburst is rarely taken amiss because everyone understands that it is directed at a statement, not the speaker, and because his objection usually has merit.

It is trite to point out that efficient progress in digital libraries would depend on the best use of the knowledge and skills of several professions, and that cooperation across disciplines is far less than would be desirable.  Current preservation literature insufficiently reflects a typical engineering focus—identification of potential failure sources and rendering them harmless.  Communication difficulties are likely to contribute to inefficient use of taxpayer dollars.

Trustworthy, Durable Digital Documents

DDQ 1(1) hinted at how we propose to preserve a digital document.  A draft [HMG5] sketches the proposal, which is mostly an application of well-known technology available today, viz.,

1)      The set of blobs that represent a work are XML-packaged with registered schema, possibly extended by yet-to-be registered archival schema.

2)      Blobs are encoded in machine-independent representations.  For simple data types, including at least ASCII-text files, this is accomplished with (ISO-)standard representations.  For complex data and programs, [Lorie] teaches how to do this.[34]

3)      Two kinds of identifier are needed: URIs and identifiers that each denote all the versions and closely related “stuff” that comprise a single work. [HMG4]

4)      Authenticity and provenance evidence starts with public key message authentication.

5)      Key management uses the Web-of-Trust model, grounded in keys that widely trusted institutions publish.  Each institution periodically chooses a new key pair and destroys its prior private key; it also publishes its procedures for protecting and using private keys and permits infrequent unannounced external audit of these procedures.

By embedding necessary information in each document, we achieve end-to-end security and also efficiency in the sense of minimized bureaucratic overhead and network traffic.  The Web-of-Trust design avoids most of the risks associated with better-known key management schemes. [Gerck]  The overall scheme shifts the locus of required trust management from procedures to protect documents to procedures to protect private keys—a task easily controlled tightly by management, and is therefore relatively inexpensive.

Criticisms of DDQ

Silence, Inference, and Implication

One reader inferred more than I implied about the limits of what DDQ would treat.  The misunderstanding arose from my March e-mail announcing DDQ as treating, "... document qualities that can be … managed … semi-automatically—properties such as …  trustworthiness, and so on.  ...  DDQ will be mostly silent on qualities that human beings must provide, such as interesting content and selection into collections."

This reader inferred an opinion that subjective judgments were unimportant.  In this case at least, it is dangerous to infer from silence any such view (or much else, for that matter).  In fact, DDQ will try to follow LW’s example: reticence on subjective judgments.  TLP is difficult reading partly because it deals only with what can be said logically.[35]

The [Tractatus Logico-Philosophicus] deals with the problems of philosophy and shows, as I believe, that the method of formulating these problems rests on the misunderstanding of the logic of our language.  Its whole meaning could be summed up somewhat as follows: What can be said at all can be said clearly; and whereof one cannot speak, therof one must be silent.        Ludwig Wittgenstein, 1918.[36]

OAIS Issues and Responses[37]

An anonymous international set of OAIS developers sent the following critique to the NSF-EU workgroup:9

We wish to provide some perspectives on statements made by H.M. Gladney in a document, called 'Digital Document Quarterly' distributed to the EU-NSF workshop members.  These comments will be largely confined to the material in the section he has titled 'A Bigger Problem Called "OAIS"'.

He makes the claim that the OAIS comes from outside the research library community, and in particular from space agency laboratories, and therefore is suspect.  However he is apparently unaware of the history of the OAIS development, including the participation from traditional archives and libraries, as well as scientific data centers.  Further, the OAIS has been reviewed by many organizations and has been adopted by them because they found it relevant.  Even if the OAIS had been developed only by space agencies, this would not be a reason to invoke the 'not invented here' syndrome.  We all use many standards that we did not develop ourselves because we find them relevant.

He is right that the OAIS is not an architecture, nor a technical design. It provides a conceptual model to aid in comparing and contrasting archive operations and data.  This is not to say that it may or may not suggest elements of a possible architecture or design to archive developers.  Implementations that have some close parallels with OAIS functional or information modeling views are neither good nor bad architectures/designs on this basis alone.  Similarly they should not be criticized on this point alone, in contrast to his implications.

He says that the organization of space agencies is not the same as that of a research library.  This is obvious, but irrelevant.  A relevant comparison is the organization of scientific data centers and that of research libraries, both of which exist within space agencies.  There is a large overlap in the types of data that need preservation, including documents, among these archives.  In fact it can be argued that significant preservation distinctions between these types of archives are rapidly blurring.

OAIS does have a focus on processes within the archive, but it also has a focus on information modeling.  It is not clear he is aware of this latter material.  The OAIS takes the view that appropriate processes and information modeling are pre-requisites to preserving information and therefore to serving Consumers with the information they desire.  In contrast, he states that the preservation process has been solved, although not demonstrated to be practical (so it is not yet solved in reality), and therefore internal archive procedures are not relevant.  No justification for this position is given.1

The spirit of this note is welcome, because it seems precisely right and because the time for technical debate is before large expenditures are incurred and before organizations take entrenched positions.  The parts of this note that I agree with influence the current DDQ number.  A necessarily brief reaction to other points follows:

1.       The DDQ section alluded to does not criticize [OAIS], but rather how it is being used by DLF members.

2.       DDQ did not suggest that OAIS was irrelevant, but rather that it was insufficient for how DLF seems to be using it.  Of course people find [OAIS] relevant,[38] since it is an ontology for library jargon and everyone agrees that common vocabulary is important.  But this justifies neither looking no further nor accepting everything in [OAIS] without question.  Consider [Marcum]’s reminder not to accept consensus opinion merely because of its source.

3.       DLF writings seem to use OAIS as an architecture and to contemplate design without having written architecture.  This may be caused by the fuzzy distinction between a reference model[39] and an architecture.

4.       DDQ in fact wrote that “the objectives, organization, management, and infrastructure are … different”—highly relevant if [OAIS] is interpreted as a design.  There is significant risk that design appropriate within NASA, for instance, is unaffordable at a university.  Assuming otherwise without analysis is imprudent.  Some design aspects will transfer easily, and others will not.  For instance consider the ingestion processes.  The most important NASA holdings contain information generated by NASA, and there is both incentive and opportunity for the NASA scientists and archivist to collaborate face-to-face to ensure faithful ingestion; a similar opportunity would be rare at a university.

5.       Sure, some distinctions are blurring.  However, it is those that are not blurring that need attention.

6.       The suggestions that I was unaware are simply incorrect.  The facts are just the opposite; the concern is that, notwithstanding the number of people involved and the resources expended, certain efficiency and design issues have not been addressed.

7.       It is true that DDQ 1(1) gave little justification.1  DDQ 1(2) makes a significant start.  It seems curious that early drafts of justification required here are characterized by some computer science critics as too obvious to merit publication!  I.e., the hints in DDQ 1(1) are all that some engineers would need to develop a design similar to that [HMG5] sketches.

News

Pay attention to litigation based on (U.S.) Digital Millenium Copyright Act (DMCA), which many people think goes too far in support of the entertainment industry.  See expressions of disagreement, e.g. by the Electronic Frontier Foundation.  The issues are exacerbated by the Hollings Bill; for instance, see the Wired Magazine article.

Copy prevention tracks at the outer edge of some music CDs can be nullified simply by inking with a felt tip pen.  Reacting to a 20th May Reuters report, NewsForge suggested that publishing the circumvention was itself a DMCA violation.  We can expect further litigation.

Technical Tips

Digital Hardware and Software

Digital hardware prices continue to drop.  You can now buy a good digital camera—the SiPix StyleCam Blink—for $37.  CD-R disks cost 13¢ each.  The best price I’ve seen for a 3.5” hard disk drive is $0.91/Gb,[40] but I prefer the IBM 80Gb Deskstar[41] offered for $89.

For some years I’ve occasionally borrowed a digital camera to capture pages of books that I could not take home from university libraries.  Since I had liked this Olympus 600-DL, I purchased the Olympus E100-RS, which can now be had for $400.  Its images of two facing pages of 9”x 6” books have excellent legibility.

I’m pleased with this camera’s easy handling and features: 10x optical zoom with jitter stabilization, macro as close as 10 cm., automatic focusing and exposure control, fast upload to any PC, and more.  If you are about to purchase a camera, you might find it helpful to use its specifications as a benchmark.  Look at what 15 frames/second can do for sports sequences.

Among many free utility programs, I can recommend FileSnoop, a Windows directory browser that shows essential information about any file and rapidly replays text and multimedia files.

Tip for U.S. Income Tax Reporting

You may be relieved that you’ve finished your 2001 income tax reporting; however, the annoyance will begin afresh next year.  If you have been employing a professional tax service whose cost (more than $500 the last time we hired one) you’d like to avoid, doing so can be easy if your financial pattern this year will be similar to year’s. 

You had to do most of the work yourself anyway—organizing the paperwork for the tax preparer.  That can eased by numbering and filing each transaction record promptly and noting numbers in a spreadsheet.  With that, your 2001 tax returns, and a $25 computer program (e.g. TaxCut or TurboTax), the task should take only a couple of hours.  Start by downloading the spreadsheet form I’ve been using.

Reading Recommendations

Making digital document representations trustworthy depends on cryptography.  A good first read is David Kahn’s The Codebreakers, followed by Simon Singh’s The Code Book.  Neither presumes technical expertise; both are technically sound and well-told tales.

An enjoyable start into Wittgenstein is Ray Monk’s biography.  Janik and Toulmin’s Wittgenstein’s Vienna conveys a broader perspective on Wittgenstein’s work than any British text.  It reminds us that, from about 1850 until about 1930, Vienna was one of Europe’s most important intellectual centers[42] and we are using thoughts begun there.

Acknowledgements

DDQ 1(2) owes much to John Bennett’s and John Swinden’s critically constructive comments and discussion of draft versions.  Discussions with Margaret Hedstom helped focus DDQ on points of community interest. 



[1]     DDQ will continue to include controversial assertions without immediately articulating the justifications, counter-arguments or answers to those counter-arguments.  This is to help the reader see woods rather than trees.  Detailed arguments and technical exposition must be left to other reports, which will be cited whenever possible.

[2]     If this is achieved, no-one will care whether or not the institutions are trustworthy or sound.  I.e., institutional trustworthiness is not an end in itself, but a means to this more fundamental objective.

[3]     Part of this will be integration of a careful selection from a plethora of inexpensive commercial and free tools, such as XML manipulators that DDQ plans to discuss soon.

[4]     This is not to say that everyone needs to read philosophy, or even to understand the conclusions, but rather that what is proposed should be grounded in ideas associated with Rudolf Carnap, Kurt Gödel, Karl Kraus, Ernst Mach, Felix Mauthner, Ludwig Wittgenstein and later scholars.  This grounding should be made obvious so that those who care can readily inspect it for themselves.

[5]     We might choose less ambitious objectives, but do not know a way that significantly simplifies the work needed.

[6]     It would be incorrect to infer a long delay before digital archiving would be prudent, because a divide-and-conquer strategy can address easy cases long before the engineering is ready for more difficult kinds of data.

[7]     Editorial practice and technical community expectations strongly discourage repeating without attribution ideas already published.  The library management literature does not seem similarly disciplined.

[8]     This is coupled to the Reference Model for an Open Archival Information System [OAIS] and concern with how this is being applied.

[9]     The NSF and EU have commissioned a work group to recommend research agendas, reporting in 4Q02.  Among Its members listed below, Hedstrom, Kenney participated in preparing [RLG 2].

         Kevin Ashley, University of London

         Birte Christensen-Dalsgaard, Statsbiblioteket Denmark

         Wendy Duff, University of Toronto

         Henry Gladney, HMG Consulting

         Margaret Hedstrom, University of Michigan

         Claude Huc, Le Centre National d'Etudes Spatiales, France

         Anne Kenney, Cornell University

         Reagan Moore, San Diego Supercomputer Center

         Erich Neuhold, Fraunhofer Institute, Darmstadt

         Seamus Ross, University of Glasgow

         Titia van der Werf, Koninklijke Bibliotheek, the Netherlands

[10]    I submitted a critique privately to the Advisory Panel.

[11]    Such perceptions, even when they are consensus opinion, can mislead us.  An occurrence inside IBM is illustrative.  For a prospective business that nearly everyone regarded with optimism, market estimates were scarce and new estimates would have been expensive and late.  That problem seemed resolved by an unexpected market-consulting publication—a respected report whose estimates supported internal opinions.  We were happy to pay $2000 for the report, and launched product development.  Months later, an analyst wondered about the consultant’s estimates, and telephoned him to inquire how they had been developed.  The answer:  “Oh, I telephoned Mr. X at IBM HQ and asked how big he thought the market would be.”

      Almost a decade later, we see that the guesses were accurate, but would still prefer knowing better the risk associated with each decision.

[12]    The problems are not new.  “… libraries originally were created to deal with … information scarcity …  Now there is too much rather than too little.  … much of the information that is overwhelming everyone is of poor quality or of little value.” [Scepanski 1966]  Personal experience suggests that both old and new writings are more easily accessible than ever before.

[13]    This is emphatically not intended to suggest that the objectively represented facts are more important than the value judgments and opinions.  The distinction is an example of the boundary between logic and ethics that, arguably, was the central objective of Wittgenstein’s Tractatus Logico Philosophicus (TLP) [Janik], a distinction that Engelmann’s metaphor communicates eloquently.

[14]    Such testing should be possible without either permission from or assistance by archiving institutions.

[15]    An example of internal evidence would be that the mentioning the correct date of some event would be evidence that the document was written later than that date.

[16]    Suppose that the New York Times (NYT) annually changed the public/private key pair with which it signed its digital news, that the Library of Congress published the full set of the NYT public keys, and that each of many people frequently used these keys to test NYT articles.  Then any key in the published set would be widely known and trusted in the sense intended here.  (Trusted is correct usage here, because you could ask any of the users, “Do you trust that key set (for testing NYT articles)?”

[17]    In due course, DDQ will introduce these and other mechanisms to the extent that the exposition demands.  (See the glossary footnote associated with bottom-up.)  In the meantime, any reader wanting immediate elucidation might consult an elementary computer science textbook.

[18]    “Although I myself have completed only finitely many sums in the past, the rule [for addition] determines my answer for indefinitely many new sums that I have never previously considered.  This is the whole point of the notion that in learning to add I grasp a rule: my past intentions regarding addition determine a unque answer for indefinitely many new cases in the future.”  [Kripke 84, pp.7-8]

[19]    Here I do not know how best to assign intensional  and extensional  to the objective and subjective columns.

[20]    New Testament, Matthew 5:16.

[21]    The longitudinal patient record—a complete medical history starting with birth—has been some people’s dream for more than a decade, but is unlikely to be realized before we produce practical privacy and security measures that address the issue identified in these paragraphs.

[22]    A general purpose computer cannot meet Department of Defense security criteria unless it includes such specialized hardware.  This made the topic one for commercial vendors, with certification procedures and bureaucracy [NCSC1] as prerequisites for computer sales. 

[23]    A proposition is said to be decidable if there exists a bounded procedure for determining whether the proposition is true or false.  (A procedure or algorithm is said to be bounded if it can be completed in a finite number of steps for any valid input data whatsoever.)  We extend the notion decidable to information if the trustworthiness of that information can be determined by a bounded procedure.

[24]    This kind of figure, called a Venn diagram (after John Venn, 1834-1923), depicts set relationships.  The size of its ellipses convey nothing; their overlaps illustrate set overlaps.

[25]    That the rationale, objectives, and methods for this are much the same as those for system security analysis and for business auditing is no accident.  We can exploit the immense body of theory and practice that has been developed and refined in the last 30 years and can anticipate that next to no further invention will be needed.

[26]    For instance, metadata work that has just been published [NISO] [OCLC] and [HMG5] can be brought together without compromise.

[27]    Computer science colleagues have reacted along the lines of, “That’s obvious!  Why are you wasting time talking about it?” to topics that auditors have asked me to explain further.  For the preservation workgroup alluded to in note 9, my current opinion is that there are no research issues, but I expect that my colleagues will not agree.

[28]    One of the few philosophical writers who impressed [Wittgenstein] from early on was Georg Christoph Lichtenberg. Lichtenberg an eighteenth-century professor of natural philosophy at Göt­tingen, had been admired by [Karl] Kraus and was a major influence on [Ernst] Mach too.  Lichtenberg's writings became very popular among Viennese intellectuals at the turn of the century.  Even more than Schopenhauer, he set the aphoristic style of philosophizing that became fashionable at this period, of which the aphorisms of the Tractatus are only one illustration.  He wrote about both theoretical physics and the philosophy of language, indeed, in a spirit which (as [Von] Wright has said) shows ’a striking resem­blance to Wittgenstein.’” [Janik, p.176]

[29]    Participation in a 1996 panel discussion stimulated this line of thinking.  The topic, estimated prospects of Documents in the Digital Culture, was addressed by 4 social scientists and liberal arts representatives seated at a table facing 4 scientists and engineers similarly seated.  First comments by each participant alternated across the gap.  I.e., the organizers structured the panel as a debate between what C.P. Snow called The Two Cultures.

      Each social scientist began along the lines, “My scientific colleague talked about …, a topic for which we must consider the relationship with …, which itself cannot be understood without [the following broad context].”  The more the speaker progressed from a narrow topic to a very broad spectrum, the more discomfort we saw among the scientists, who fought down urges to interrupt the speaker’s thrust.

      The style of each scientist was along the lines, “The previous speaker dealt with …, a topic too broad for me to say anything specific about.  I’ll deal with [thus and such] a small piece.”  Implicit in this was confidence that others would address and integrate similarly small pieces, possibly not until years later, and that the whole would amount to a large addition to the state of the art.  The further the speaker progressed towards solving a small problem segment, the more discomfort we saw among the social scientists.

[30]    The purpose of reference models like [OAIS], ontologies, and computer languages like RDF is to reduce the effects of misunderstood meaning of words.

[31]    The opinions in this section are informed by personal experiences that were not always happy and personal mistakes associated with insufficient awareness of the difficulties alluded to.

[32]    High intelligence is needed, but is not enough to survive and be happy in a technical research career.  You also need unusual self-confidence and tolerance for long periods of solitary work when next to no-one appreciates what you are trying to do or is optimistic about the chance of success.

[33]    The popular press is more interested in spectacular ad hominem attacks.  See White’s Acid Tongues and Tranquil Dreamers for celebrated failures to observe the politenesses.  In fact, much has recently been made of a controversial 10-minute incident half a century ago involving Ludwig Wittgenstein and Karl Popper!  [Edmonds]

[34]    Which data types are simple in the sense intended and which are complex is yet to be decided.

[35]    The problem on which Wittgenstein embarked … was that of constructing a general critique of language capable of showing, at one and the same time, both that logic and science had a proper part to play within ordinary descriptive language, by which we produce a representation of the world analogous to a mathematical model of physical phenomena, and that questions about "ethics, value and the meaning of life," by falling outside the limits of this descriptive language, become—at best—the objects of a kind of mystical in-sight, which can be conveyed by "indirect" or poetical communication.  [Janik, p.191]

[36]    Preface to Tractatus Logico-Philosophicus.  The boldface is not in the original, but rather added by DDQ.

[37]    For readers’ convenience, I’ve built hyperlinks from the critique to my responses.

[38]    Had no-one found [OAIS] relevant, it would have died quietly and I would not have raised the issues at hand!

[39]    I believe a reference model to be much the same as an ontology—an articulation of terms of reference and their interrelationships

[40]    Fry’s Electronics has offered a Western Digital 120 Gb 5400 rpm drive for $109. 

[41]    IBM Model 120GXP, 8.5 ms. seek, 7200 rpm, 2Mb buffer.

[42]    Its best-known names include Alban Berg, Ludwig Boltzmann, Johannes Brahms, Rudolph Carnap, Sigmund Freud, Ernst Mach, Gustav Mahler, Arnold Schönberg, and Arthur Schopenhauer; these men mostly knew each other.