Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 6, Number 4, 4Q2007

 

 

 

 

HMG Consulting

Saratoga, CA 95070

©  2007, H.M. Gladney

 

ISSN: 1547-8610

DDQ 6(3)                                                     DDQ 7(1)

 

Digital Preservation

The abstract of a new paper, Economics and Engineering for Preserving Digital Content, reads:

Progress towards practical long-term preservation seems stalled.  We preservationists cannot afford unique technology, but must exploit marketplace offerings.  Macro economic facts suggest shifting most preservation work from repository institutions to their users.

Prior publications describe conceptual solutions for all known challenges of preserving a single object, but do not deal with software development or collection scaling.  Much of the software needed is available.  It has, however, not yet been selected, adapted, integrated, or deployed for digital preservation.  Tools for daily work can embed packaging for preservation without much burdening their users.

We describe a practical strategy for detailed design and implementation.  Document handling is complicated by human sensitivity to communication nuances.  Our engineering section therefore suggests how project managers can master the many pertinent details.

A Discussion about Digital Trustworthiness

Literature of the digital preservation community continues to pose questions already answered in computer science and software engineering literature.  An example is an Oct. 17 posting by Helen Tibbo (HT) to the MOIMS Repository Audit and Certification blog.  Since my answer to included questions seems to have been rejected by the blog manager, without reply to my inquiry why this occurred, DDQ excerpts HT’s posting and follows this with my reply:

… First, what is the goal of [the TRAC] standard? …  Indeed, what is the purpose of audit and certification?  Is it not to give contributors and users of repository materials confidence that what is deposited will remain essentially as it is over time, that any changes are documented, and that the materials will remain available, accessible, and understandable?  

Even the highest level of certification will not ensure digital longevity and authenticity, anymore than best practices in analog repositories will ensure that no objects go missing or that none are defaced in some way.  None of this is providing certainty; only risk projections that provide confidence. 

My greater concern is in identifying the elements of any list such as TRAC that are the most likely to indicate either a repository that is likely to fail or one that is likely to succeed.  Back to the restaurant analogy: I am more concerned when hot food is not kept hot and cold food is not kept cold than things like food being stored on the floor or even roaches.  One is much more likely to become sick from food that was left at room temperature than if a bug crawled on it.  What are those food temperature equivalents for the preservation likelihood of digital objects? 

This HT posting ends with questions and concerns that DDQ addresses below.  Essential background includes that:

(1) it is too early to expect consensus about social norms and tools for evidence about digital documents.  This topic first attracted attention in R&D literature about a decade ago.  In contrast, society has used documents on paper as evidence for more than a century.  

(2) How every specifically identified technical requirement for digital document trustworthiness has been addressed.[1] 

(3) The meaning and attributes of any document depend on an unbounded set of other documents.  Because of unboundedness, no unequivocal assertion of context or trustworthiness of any document is possible. 

(4) One cannot address the issues properly without some specification of the authenticity risks at hand.  The likelihood of deliberate falsification of academic and cultural literature is small (except by totalitarian governments), as is the potential damage to readers.  In contrast, it is often tempting to falsify legal and financial documents to damage unsuspecting victims.

HT asks: "Isn't part of the issue for us that it is much harder for contributors/users to make the trustworthiness decisions in the digital realm than in the analog?"

DDQ responds: There is little reason to believe that digital documents present unusually difficult problems.  In the paper world, trustworthiness decisions depend on conformance to socially-accepted practices that evolved over several centuries, with considerable trial and error.   An equivalent process for digital documents has hardly begun, and has certainly not progressed to choosing, implementing, and accepting widely understood standard practices.  Thus there is, today, next to no reduction to practice upon which users can rely.

HT asks: "So it is not "proof" of authenticity that we can ask for but rather the track record of behaviors that provide us confidence that the repository will continue to follow good practice in the future."

DDQ responds: One must start with a clear notion of what is meant by 'proof'.  A possible meaning is that provided by criminal judgment criteria along the lines of "beyond reasonable doubt".  And this certainly depends on, among other things, HT's "track record of behaviors".

HT comments: "My greater concern is in identifying the elements of any list such as TRAC that are the most likely to indicate either a repository that is likely to fail or one that is likely to succeed."

HMG comments: It is very difficult to define procedures (other than certain document-oriented procedures that can be executed in contributor/user machines) to ensure the authenticity of every holding of a repository.  Even if such procedures were known, it would be very difficult to ensure that a repository conformed to them with no more than small discrepancies.  Nor can interim audits demonstrate that failures have not occurred.  Moreover, it would be burdensome for end users to judge which procedures/audits a document had been protected by, and whether or not these procedures met their own risk minimization objectives.  Finally, it is unlikely that many end users would have the expertise or patience to make the implied investigations.

Security Problems from Insider Errors

An RSA Security survey found that many grievous security problems originate in employee carelessness or ignorance.   Anyone who wonders why I am skeptical about “trusted digital repositories” might consider the likelihood that no mishap threatens archive contents held for a century or longer!

Canadian Digital Preservation Strategy

In 2005, Library and Archives Canada (LAC) initiated a dialog about a Canadian Digital Information Strategy (CDIS).  Its consultations culminated in a 2006 National Summit.  A broad consensus emerged, leading to CDIS development.  A chapter devoted to digital preservation is summarized with:

·       Conduct a national appraisal of digital information priorities for long-term retention and preservation, and accelerate capture accordingly.

·       Develop a distributed network of Trusted Digital Repositories with responsibility to capture, manage, preserve and provide access to Canada's digital information assets

·       Foster Canadian R&D that advances the goals of better managing, sustaining and providing access to digital information, and contribute research outcomes to the global effort.

·       Develop new workplace skills capacity for digital information management and preservation.

·       Raise the public and political profile of digital preservation issues.

This conforms to what other nations’ archivists are writing, including accepting all the flawed notions implied by “Trusted Digital Repositories.”  It illustrates a weakness of most committee reports: reduction to bland consensus that exhibits no original thinking whatsoever.  It also fails to identify any specific action commitments.  Once again, I am disappointed with words from the archival community!

Epistemology

Semantics

G.E. Moore's deliberations about the "mean­ing" of goodness led to the conclusion that "the good" is an idea that must be understood on its own terms.  The meaning of this expression cannot be captured by a redefinitional formula of the format "The good is that which . . . (is happiness promoting, conducive to the greatest good for the greatest number, or the like)."  For a definitional clarification of is predestined to futility.  The philosophical quest (going back to Socrates) for the definition of this and similar terms is totally misguided.  To be sure, this check is not a complete defeat for the clarifica­tory project.  For we must distinguish between definitions and explanations.  Explaining what "the human good" comprises as a viable project—friendship, enjoyment, etc.—is one sort of thing.  But defining what "the human good" as such means is something quite different—and indeed something that is infeasible.                                                                                                                    Rescher[2]

Academic literature and the I/T trade press are replete with articles suggesting that the Semantic Web will remedy readers’ confusions with what authors mean.[3]  Hyperbole should be balanced by sobriety about the limitations of communication, starting with what we might mean by “semantics”.

We find explanations such as the following excerpt from Wikipedia:

[T]here is no capability within the HTML itself to unambiguously assert that, say, item number X586172 is an Acme Gizmo with a retail price of €199, or that it is a consumer product.  Rather, HTML can only say that the span of text "X586172" is something that should be positioned near "Acme Gizmo" and "€199", etc.  … There is also no way to express that these pieces of information are bound together in describing a discrete item, …

The semantic web addresses this shortcoming, using the descriptive technologies Resource Description Framework (RDF) and Web Ontology Language (OWL), and the data-centric, customizable Extensible Markup Language (XML).    machine-readable descriptions enable content managers to add meaning to the content, i.e. to describe the structure of the knowledge we have about that content. This way the machine can process knowledge itself, instead of text, …

The quotation from Rescher reminds us that an attempt at precise definition would be futile.  (The Cambridge Encyclopedia of Philosophy includes 13 distinct explanations.)  Without expecting that any reader will accept it entirely, my thinking usually starts with the notion that semantics is a set of relationships between language strings (words, phrases, sentences, …) and the object or situation each represents—not a relationship between two language strings, i.e., not a definition![4] 

What the Wikipedia entry teaches us is that the Semantic Web mechanisms can add two ingredients to HTML and XML objects: thesaurus entries and graphs communicating structure.  What can these convey, and what is necessarily still missing?  They can narrow a reader’s possible suppositions about what an author might have intended.  However, neither conveys certainty about what is being communicated. 

Thesaurus entries can help a reader reject widely incorrect supposed meanings.  But the poor reader still cannot know with certainty that his understanding is closer to an author’s intention than what he understood from the author’s own words.

One might hope that a reader could compare an author’s graph expressing structure to his own notion to verify supposed meaning.  Unfortunately, its help is limited in much the same way as thesaurus entries.  One can readily detect that a wildly incorrect ontology does not correspond to a published ontology.  However, comparing two ontologies to demonstrate that they are equivalent is NP-complete, i.e., computationally infeasible.

The structures alluded to above are what today’s authors usually mean by “ontologies”.  (A century ago, “ontologies” had a quite different meaning!)  Martin Hepp’s recent article[5] identifies basic inadequacies of current engineering for ontology exploitation.

Startling Facts

Erma Bombeck’s Writers Workshop shares the following startling estimates, and other related facts:

1/3 of high school graduates never read another book for the rest of their lives.

42 percent of college graduates never read another book after college.

80 percent of U.S. families did not buy or read a book last year.

70 percent of U.S. adults have not been in a bookstore in the last five years.

57 percent of new books are not read to completion.

70 percent of books published do not earn back their advance.

70 percent of the books published do not make a profit.

News

Short Takes

Carnegie Mellon University's digital library collection has exceeded 1.5M books.

A spate of e-mails suggests that many universities have ceased thesis deposits at UMI Microfilm, or are considering doing so.

Hormel has lost a spam lawusuit!

Standardization of PDF 1.7 has been approved as  ISO 32000.

Investigators at UCSC are developing a tool to measure the trustworthiness of any Wikipedia page.

A new ACM periodical, Transactions on Knowledge Discovery from Data, is seeking research papers on information discovery and analysis.

This year’s desktop Linux survey attracted twice as many respondents as last year’s.  More than 50% favor Ubuntu distributions.  See also eWeek’s technical review.

Newspapers and governments began altering photographic images long before digital photography made this kind of fraud easy.  A website, Top 15 Manipulated Photographs, illustrates this.

IBM’s “Many Eyes” project is an experiment on the power of human visual intelligence to find patterns.  You can explore it online.

Hard drives are often taken for granted, but they're the main repository for all our digital data.  Western Digital tries to get the hard drive a little more respect in its Editor's Day event.  Work towards higher density data storage continues vigorously.  Notwithstanding two decades in which we saw 2x improvements roughly every 15 months, another 20x still seems feasible.

What is WiMAX?  Will We Ever Have It?

Worldwide Interoperability for Microwave Access (WiMAX) is a technology standard for radio transmission of large amounts of digital data.  Compared to today's WiFi radio links, it will increase signal range from a few hundred yards to about 20 miles and deliver data more than 10 times more quickly.  It is also an economical alter­native to coaxial cable and telephone lines for bringing broadband Inter­net access to homes and businesses.  WiMAX backers plan to build it into TVs, notebook PCs, and smart phones.

Few technologies have been as widely hyped and as broadly anticipated as WiMAX.  But for several years it has been expected “next year.”  A