Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 7, Number 1, 1Q2008

 

 

 

 

HMG Consulting

Saratoga, CA 95070

©  2007, H.M. Gladney

 

ISSN: 1547-8610

 

An extrapolation of its present rate of growth reveals that in the not too distant future Physical Review[1] will fill bookshelves at a speed exceeding that of light.  This is not forbidden by relativity, since no information is being conveyed.                                      Attributed to Rudolf Peierls, 1961

In case DDQ readers have not noticed, I mention that the newsletter exploits links to other people’s Web contributions much more heavily now than it did in its early years.  Hopefully each reader will find a few of the linked web pages particularly interesting.

Open Access

Scanning to Create On-Line Books

Microsoft® has announced a massive book digitization initiative.  Amazon® and Google® projects had been announced earlier.

·       Perhaps the earliest massive book digitization project was the Carnegie-Mellon Million Book Project.  It now provides on-line access to 1.5 million titles.

·       In December 2004, Google announced digitization partnerships with the Libraries of Harvard, Stanford, U. Michigan, and Oxford, as well as with the New York Public Library.  “Users searching with Google will see links in their search results page …  Clicking on a title delivers a page where users can browse the full text of public domain works and brief excerpts and/or bibliographic data of copyrighted material.”  A concise description is available for Google’s enhanced catalog of the world's books.

·       In June 2007, Amazon announced its project to capture and distribute hard-to-find books.  Its partners will provide rare and inaccessible books in return for a revenue share.  The partners are the libraries of Emory University, the University of Maine, and two cities—Toronto and Cincinnati.  Reproductions will be sold.

·       Some research libraries have rebuffed commercial offers to scan their books.  These libraries have instead joined the Open Content Alliance, a not-for-profit effort aimed at making works broadly available.

·       Microsoft has released Live Search Books and a position statement on copyright.  Not everybody is pleased with a related Microsoft/Library of Congress deal.

At the same time, public libraries are adapting to the Internet.  For balanced commentary, see Paul Courant, Scholarship and Academic Libraries (and their kin) in the World of Google, First Monday 11(8), 2006.

For managing my personal collection and also bibliographic records from collections such as those just described, I have found the CollectorZ Book Collector utility ($40) convenient.

Scholarly Material On-Line

The Oxford Internet Institute will use the Oxford Research Archive to make available university research output.    Its announcement stated, “By having our outputs permanently and securely archived by the University, we are confident that it will significantly increase the visibility and dissemination of our work.”  Also Harvard University will disseminate its faculty scholarly writings in an open archive.  These Oxford and Harvard decisions will probably persuade other research universities to create similar offerings.

The Just Free Books website identifies hundreds of websites providing free content access.

In 2007, the IEEE Xplore® collection added over 90,000 documents including all of Proceedings of the IEEE and IEEE Computer Magazine, historical content from over 40 IEEE titles, and selected engineering publications dating back to 1913.[2]

The Massachusetts Institute of Technology has long made its course materials widely available.  More recently an MIT physics professor's lectures have become a Web hit.

Carl Malamud is making available free online copies of every U.S. Supreme Court decision and Court of Appeals ruling since 1950, 1.8 million rulings in all.[3]

Marginalization of Research Librarians

The details are difficult to predict, but surely all information access actions that can be automated will be.  This and digital preservation literature drew my attention to potential marginalization of research librarians.

Twenty years ago my first tool for finding information was research library catalogs; today it is a Google search of on-line resources.  Although online search tools do index library catalogs, most useful “hits” seem to come from other sources.  Within limits dictated by copyright law, the online collections just described will replace my visiting nearby libraries.  Of course, a few research librarians are helping achieve this convenience.  However, it will reduce our future need for help from their colleagues.  Simultaneously, work on search semantics will reduce how long it takes me to find the material I want.

These changes, and others that I cannot anticipate, are inevitable.  The economic circumstances have eliminated entire professions, such as stenography.  The only at-risk professions that will survive are those whose members invent replacement services for which they are uniquely qualified.

Digital Preservation

Subject-based and institutional digital repositories are increasingly being hailed as the preferred means for safeguarding the future accessibility of digital information.     Nestor Newsletter[4]

Who is doing the hailing?  Anybody other than institutional repository staffs?

Papers questioning digital preservation aspects, particularly the “Trusted Digital Repository” approach, have begun to appear, together with what may be beginning attention to a viable alternative—helping information creators prepare digital documents to be suitable for a long future.  See:

·       M. Seadle and E. Greifeneder, In archiving we trust: Results from a workshop at Humboldt University in Berlin, First Monday 13(1), January 2008.

·       R. Harvey, So where’s the black hole in our collective memory? Digital Preservation Europe, 2007, for which comments are solicited.

·       J.A. Smith and M.L. Nelson, Creating Preservation-Ready Web Resources, D-Lib Magazine 14(1/2), 2008.

·       G. Shankaranarayanan and A. Even, The Metadata Enigma, Comm. ACM 49(2), 88-94, 2006.

Carlos Oliveira’s 2007 presentation, Digital Curation and Preservation: Funders’ Perspectives, shows current European Commission bias towards repository-based solutions, somewhat mitigated by its call for “Redefinition of roles … of institutions (libraries, museums, archives, universities) in charge of creating, collecting, organising, preserving and providing access to knowledge-bearing objects.” 

However, achieving long term preservation of sensitive content by improved repository procedures is almost surely infeasible.  Straining to transform cultural repositories to be “Trusted Digital Repositories” cannot be a complete preservation solution because it cannot guarantee that stored documents have not been feloniously modified.  (At least, no-one has shown how a public access digital repository can be reliably secured for periods of many years, much less for a century or longer.)  What’s more, its objectives are not even the best objectives available.[5]

When a sculptor wants to please future generations, he crafts in stone or pours molten bronze into short-lived molds.  In ancient custom, statues were mounted in open spaces accessible to many people.  But custodians long ago learned that longevity was favored by moving artifacts into churches, museums, and palaces.  To provide access, they mounted replicas on outdoor pedestals, and sometimes poured additional copies for sharing with a far-flung public.[6] 

A lesson is evident.  Almost everybody, including this author, was gulled by the objective "digital preservation."  History teaches that the most effective objective might be to make objects (material and digital) durable when they are first created or, more precisely, when they are first shared, and to exploit many-fold replication in already-available repositories, even if these repositories are not hardened against mischief.

A Software Engineering Contribution

ACM periodicals have published few articles addressing digital preservation.  Since I believe that software engineers are not as aware or engaged in preservation as would be socially valuable, I was encouraged by the recent appearance of The Provenance of Electronic Data,[7] which argues that “Users must know whether they have confi­dence in their applications' electronic data; it must therefore be accompanied by its provenance that describes the process that led to its production.” 

Unfortunately, the laudable purpose of the article is not matched by its content, which is less about provenance as I understand it than about tools for recording measurement circumstances.  Our undergraduate science tutors emphasized that our laboratory notebooks needed to document every circumstance important to indicating precisely what was being measured.  Where computers are used as scientific or clinical tools, part of this process can be automated.  Tools for this are what the article describes.

DDQ readers are likely to be misled as I was misled, by the article’s unusual use of the word “provenance”, and the close relationship claimed for this work with conventional usage of "provenance", as summarized in the article’s final sentence.  “In the same way scholars can appreciate works of art by studying their documented history, users would be able to gain confidence in electronic data thanks to provenance queries.”  I prefer to stay with the traditional meaning of “provenance” and its connection with notions of authenticity.

What most interests me in The Provenance of Electronic Data is its reference implementation for tools to record clinical measurements.  One of the previewers of DDQ emphasized that “the article doesn't do justice to the amount of work that has been done” and recommended attention to an EU Provenance Project architecture web page.

For “provenance”, the Concise Oxford English Dictionary has “1. the place or origin or earliest known history of something. 2. a record of ownership of a work of art or antique.”  The Online Free Dictionary definition is similar.  A current Wikipedia entry starts with “Provenance is the origin or source from which something comes, and the history of subsequent owners (also known in some fields as chain of custody).  The term is often used in the sense of place and time of manufacture, production or discovery.”

The preceding paragraphs might be seen as mere pedantic quibble over the meaning of a common word.  Some of my friends who regularly preview DDQ and sometimes recommend changes had valid criticisms of drafts and still have concerns with what appears above.  Indeed, one recommends avoiding controversy by simply withdrawing all DDQ mention of The Provenance of Electronic Data.  However, other current conversations emphasize that rapid progress towards practical digital preservation requires more effective collaboration between archivists/librarians and scientists/engineers than we have achieved so far.  Such collaboration would be enhanced by as much common vocabulary as is feasible without inhibiting thinking about subtle topics.  Shared meanings for critical words such as “provenance” are critical.

If the meaning adopted by Moreau et al. were used for works of art, the provenance of a painter’s work might include chemical description of the paints used and other technical factors.  If such topics are dealt with it all, it is in descriptive and critical articles or books about the painter,[8] not in the provenance information of any painting. 

On the other hand, a philosopher of language might remind us of the fuzzy boundary between identifying an artifact’s source and describing this source.  Even so, a modern author should stay as close as possible to traditional meanings, or describe and justify his divergences when he cannot conform.

All this has immediate, practical consequences.  Librarians and archivists have worked for many years to define shared and extensible schema for metadata that include provenance information, such as the Metadata Encoding and Transmission Standard (METS).   A recent step is appearance of a practical prescription. The authors of The Provenance of Electronic Data should consider how to bridge from their work to such archivists’ more mature work. 

More generally, the authors and other members of the technical community should consider how they might contribute to cross-disciplinary collaboration to achieve broadly-based, convenient long-term digital preservation.

U.S. Preservation Initiatives

A Blue Ribbon Task Force on Sustainable Digital Preservation has been funded by the National Science Foundation and the Mellon Foundation, with collaboration of the Library of Congress, JISC, CLIR, and NARA.  Its two-year mission is “to develop a viable economic sustainability strategy to ensure that today’s data will be available for further use, analysis and study.”  Task force members have been named.

Funding for sustainable storage development is available by way of an NSF Cyberinfrastructure RFP.

DDQ readers might be interested in pertinent commercial initiatives.  An answer to new laws and regulations for long-term data storage is provided by network attached storage servers.

Segmented, parallel architectures are improving capacity, performance, management and cost of file storage. Significant improvements in data delivery to computing elements and scaling capacity can be achieved by digital repositories.  Many white papers are available.  Readers might also be interested in replication software such as the HiT DBMoto offering.

The PREMIS Data Dictionary for Preservation Metadata, version 2.0 (a revision of the 2005 Final report of the PREMIS Working Group) has just been released.  This document specifies metadata for implementation in multifarious repositories, supported by guidelines for its creation, management and use, and oriented toward automated workflows.  Enhancements include expanded rights metadata, more significant properties level, and extensibility mechanisms for several metadata units.  XML schema for implementation have been drafted..

A recent step is appearance of a practical prescription, Guidance for using PREMIS with METS.[9] 

Other Preservation Initiatives

Readers might take interest in the Province of Alberta Children’s Services information preservation project—a rare example of a state/provincial level initiative with good political/bureaucratic support.  Garth Clarke will describe the project at the forthcoming Canadian Archivists 2008 Conference.

Microsoft is currently investigating MS Word® enhancements for managing embedded metadata, doing so along lines suggested in DDQ long ago.  These are intended to make it easy for authors to embed metadata.  They will also enable commentators to access and edit journal and article metadata with XML using the NLM format.

Microsoft is also considering infrastructure for journal templates that can encapsulate semantics, help authors to annotate with metadata, and enhance structure and content validation prior to publication submission.  According to Lee Dirks of Microsoft, a Blue Ribbon TF member, they believe that enabling the capture of metadata early in the authoring cycle will lead to a improv