Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 7, Number 2, 2Q2008

 

 

 

Past DDQ numbers

HMG Consulting

Saratoga, CA 95070

©  2008, H.M. Gladney

 

ISSN: 1547-8610

Open Access: OCLC and Google to Share Book Information

OCLC and Google have agreed to exchange book discovery data.  Google will link from Google Book Search to WorldCat, which will drive traffic to online library services.  Google will also share digitized book data.  WorldCat will represent OCLC member library collections and link books scanned by Google.  A user who finds a book in Google Book Search will be able to use WorldCat to find local library copies.

Archiving and Long-Term Digital Preservation (LDP)

Recent correspondence about archiving reminds me how difficult it is to communicate precisely.  Writing is more difficult than conversation because no listener can signal confusion that a speaker might promptly correct.  This challenge has been particularly evident in writing DDQ 7(2).  Even though what follows has been repeatedly edited with advisors’ help, I am not as confident as I would like to be that readers will infer what I intend.  The difficulty is even greater for documents in long-term storage (Figure 1).

Figure 1: Simplified version of a model used in Preserving Digital Information[1]

One can reduce the communication difficulty by providing careful definitions and contextual information.  However this remedy creates its own hazards—lengthy explanations that try readers’ patience, blizzards of detail that obscure central points, and seeming pedantry.

Such difficulties hamper community attempts to design information sharing tools, a current emphasis in digital library literature.  Different authors use even well-known terms, such as “archiving”, differently.  Partly for this reason, I understand only imperfectly what the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (abbreviated “BRTF” below) includes within its “sustainability” scope, or precisely which questions this group intends to answer.

As background for what follows, some DDQ terms of reference need to be explained.  By “archiving”, DDQ means digital content management needed to ensure ready access to reliable records both immediately and in the distant future.  It is useful to partition this into topics which, in good information system design, are only lightly coupled:

(a)     Management prior to repository ingestion.  This portion of digital object management is more evident for bureaucratic records than it is for cultural and scholarly works.[2]  Bureaucratic records are typically generated, formatted, and managed to conform to pre-existing rules.  Controls are less formal for other data.  For concise reference, DDQ will allude to this portion as DocPrep;

(b)     Core digital library services, being the functionality defined by a two-year old interface standard,[3] JSR 170.  DDQ will allude to this portion as DocSS (as suggested by Figure 2 in DDQ 5(2));

(c)     Near-term repository management, including all aspects of ingestion, curation, cataloging, access provision and business controls, and storage management—everything needed for content user services now and for roughly ten years.  Typically this implements higher level digital repository services that rely on one or more DocSS instances.  DDQ will allude to this portion as DocArch (see the second largest box in Figure 2 in DDQ 5(2), where it is labeled “Archival Store”);

(d)     Long-term digital preservation, which is taken to be all measures required and/or undertaken to mitigate digital object unreliability caused by ravages of time, including human misfeasance, fading human memory, and technological obsolescence.  DDQ has already called this LDP.

A common feature of these partitions is that each focuses on tools that handle target content directly.  An emerging software category addresses

(e)     Assisting human managers of repository institutions for planning their work, managing selection into collections, and signaling execution deadlines.  For examples, see EU Planets tools below.

DocPrep is important for bureaucratic records management, as in the U.S. government, but might be of little interest to scholarly and cultural repositories.  DocSS is mature, with many COTS and open source offerings, so that new R&D projects for this component would be sensible only for specific enhancements, such as performance, scaling, or reliability improvements.  DocArch implementations are likely to differ for different kinds of institutions; for instance, small colleges might have different needs than the University of California.  They might also require extensive parametric customization for institutional preferences and for coupling to document manipulation tools; however DocArch ideas seem mature.

A Note on Partitioning into Software Modules

Among reasons for treating LDP as a distinct partition is the fact that it can be developed and connected to the other components without much changing their implementations or disrupting installations that use them.  The components (a) through (e) are high level partitions of archiving services.  Each of these should be composed of several smaller lightly coupled components.  Such partitioning into lightly coupled components is particularly helpful when the components have different maturities and different portability among installations.

What do we mean when we say that software modules are “lightly coupled”?  We mean that a programmer responsible for one module can change its implementation without impacting coupled modules and without consulting with the programmers responsible for these other modules.  The key is more or less formal agreements on the syntax and semantics of the interfaces that each module makes available to or uses from coupled modules.  So-called APIs (application programming interfaces) are an agreement form that is useful between acquainted programmers.  Interface standards, such as JSR 170 (Content Repository for JavaTM technology API), are more formal interface specifications.  An interchange convention for sharing data objects via communications links, such as the protocol for Object Reuse and Exchange (ORE), is still another form.

Content Partitioning in Preservation Discussions

Figure 2: Workflow for bureaucratic documents

 

Digital archiving literature seems to be partitioned—articles about bureaucratic record handling (Figure 2), articles about managing cultural and scholarly articles (Figure 3), a beginning of articles about personal information,[4] and perhaps further partitions—with articles about one partition seldom citing those in the others.  For instance, there seems to be little practical connection between work on ERA at the National Archive and Records Administration (NARA) and that on NDIIPP at the Library of Congress.  To some extent, this is justified because formal rules and human roles are significantly different in the different partitions.  An unfortunate side effect is little attention to synergism that could reduce the cost of tools and enhance information sharing between partitions.

image description

Figure 3: Workflow for cultural documents

Portico and Ithaka’s survey of about 1000 U.S. library directors identifies another partition, electronic periodicals.  At the same time, A Comparative Study of e-Journal Archiving Solutions has appeared.  It makes evident striking differences of electronic periodicals from other documents.  Their treatment is dominated by intellectual property law considerations.  The authenticity of saved periodicals is unlikely to be a big issue because the material is not a tempting target for felonious modification and because any interesting periodical is likely to be saved by many autonomous libraries.  The topics discussed in the study suggest that today’s urgent issues for e-periodicals have more to do with near-term archiving (less than 50 years) than with long-term archiving (more than 100 years).

LDP literature is more difficult than it might otherwise be because different communities display different notions of worthwhile research.  If a computer scientist can describe how to satisfy a service requirement, he would say it is not a proper research topic.  In contrast, the U.S. NDIIPP plan reflects a common view that a research topic exists for any information management need unsupported by available software.[5]  In IBM Research corridors in the 1980s, the boundary between research and practical engineering was called “SMOP”—“a simple (or small) matter of programming.”  This did not necessarily mean that the task being discussed was either uncomplicated or inexpensive.  Instead it meant that computer scientists knew answers to its difficult questions, allowing most of the work to be passed to a software development team.  Patent law wording is apt; one cannot obtain protection for an artifact or process design “obvious to someone versed in the state of the art”.

How to Speed Up LDP Progress

[T]here has been relatively little discussion of how we can ensure that digital preservation activities survive beyond the current availability of soft-money funding; or the transition from a project's first-generation management to the second; or even how they might be supplied with sufficient resources to get underway at all.                                                                                                                           Lavoie[6]

The Blue Ribbon Task Force on Sustainable Digital Preservation (BRTF) has been described by the Director of the NSF Office of Cyberinfrastructure as “the only group I know of that is chartered to help us understand the economic issues surrounding sustainable repositories … ”.  The BRTF web site declares one objective to be “a research agenda … [for] economic sustainability of digital information”.  As suggested by the Lavoie quotation, this will surely include recommendations on how repository institutions can be funded and also how their running expenses can be minimized.  Will it also be within the BRTF scope to suggest how research and development of LDP tools can be made more efficient and effective than is currently the case?

It seems to me that LDP progress would be accelerated if participants would engage in more sharing of reusable modules than I am aware of.  Certainly, they often refer to “modular architecture”.  By copy of this DDQ number I am asking readers to tell me about any open source LDP code they know of.  I will also write to the larger LDP projects to inquire.  DDQ 7(3) will publish the information I receive.

I believe that digital preservation research funded by taxpayers has been very wasteful, partly as a consequence of poor scholarship.  Authors seem to pay little attention to what is in the literature.  What needs to be said is perhaps controversial, but nevertheless under consideration to be a theme of DDQ 7(3).  The problem is illustrated by a JCDL 2008 paper.

When I first saw the A Data Model and Architecture for Long-term Preservation,[7] I wondered if it described a special case of TDO methodology.1  Since this was not clear to me, and is still not entirely so, I e-mailed its authors that I could not see what novelty their paper conveyed and requested clarification.  After two weeks without an answer to this question, I annotated a copy with notes about apparent problems, prior work, and missed opportunities.  I sent this to the authors, repeating my question.  That netted a response mentioning end of term workload and reminding me of copyright limitations.  The authors have yet to react to the points communicated.

Why don’t I merely ignore this paper?  It’s an example of much wasteful work—wasteful because authors don’t build forward from prior work—even authors from prestigious institutions such as the University of California.  What’s just as disturbing is that JCDL referees fail to detect problems such as those illustrated by the example.  Because the problem seems to be widespread, DDQ 7(3) will analyze what I see, expanding on this and other examples.  To illustrate it, I am making my critique of the example available to anyone who requests it by e-mail.

On a positive note, JCDL 2008, in which the criticized paper was presented, contains several papers whose ideas might prove helpful for semi-automatic creation of metadata called for in the TDO architecture.[8]  Also note that the Bibliographical Center for Research is inviting prompt critical comment on its CDP Imaging Best Practices draft document.  (The announced 13th June deadline is “soft”.)

NARA’s Electronic Records Archives (ERA)

The National Archives and Records Administration (NARA) summarizes its public commitment by, “ERA will be a comprehensive, systematic, and dynamic means for preserving virtually any kind of electronic record, free from dependence on any specific hardware or software. …  ERA will support the National Archives mission by making it easy for the public and government officials to discover, use, and trust the records of our government”.[9]  Presumably this includes LDP as defined above.

NARA is overwhelmed by digital information, facing huge increases in both electronic records and classified records, according to Congressional testimony by National Security Archive director Tom Blanton.  Blanton summarizes his problem list with,

[T]he National Archives and Records Administration is a tiny agency with … overwhelming challenges.  NARA’s entire operation ($404 million …) is about equal to the cost of a single Marine One helicopter ($400 million) in the planned fleet of 28 … intended to serve the President and senior officials.

He recommends:

Congress should order NARA and the agencies to re-engineer agency relationships so they create archive-ready records, not just records that NARA has to re-process down the line.  The proposed bill H.R. 5811 would make a good start on this challenge, but we need to go further, …

Compare a recommendation in Economics and Engineering for Preserving Digital Content, quoted below.

Blanton emphasizes difficulties caused by classified records, which are peculiar to government data.  Of more interest to most DDQ readers might be NARA’s Electronic Records Archives (ERA) project,[10] whose largest expenditure is a Lockheed Martin (LM) contract for about $300M.

Reports of a May 14th U.S. Senate hearing and some private rumors led me to wonder whether the ERA project was experiencing serious difficulties.  So I drafted some harsh paragraphs for this DDQ number and shared them in a letter to Dr. Weinstein, the Archivist of the United States, inquiring whether they were appropriate.  I promptly received a very responsive letter from the Director of the ERA Program Office, Kenneth Thibodeau.  As well as answering my specific current concerns, it pointed me at NARA ERA documentation that (for unknown reasons) I had not previously found and the 2004 ERA RFP.  It also emphasized that COTS products figured prominently in the upcoming LM delivery.  It further explained the reason for a separate implementation for Presidential files.  Requirements of the Presidential Records Act differ from those of the Federal Records Act.  It will take me some time to absorb the two requirements sets.

I am still uneasy about how well ERA will meet its objectives, but have no evidence for this unease.  The original LM delivery commitment had been September 2007; actual delivery is expected this month (June 2008).  Since such delays are common for big software, this delay does not itself worry me.  We’ll see whether there is reason for DDQ to comment in some future number.

Limitations of OAIS

In view of the many archiving articles that cite OAIS as a sort of “good housekeeping sign of approval”, readers might be interested in a critical look at how OAIS is used.  Alexander Egger has written about shortcomings of the model.[11]

E-spionage Threats and “Trusted Digital Repositories” (TDR)

Enthusiasts for the TDR approach[12] might not believe repeated assertion that it depends on unrealistic assumptions.  One such is that a stored object can be protected for decades or longer against felonious modification. This is called into question by a BusinessWeek probe of attacks on America's most sensitive computing resources.[13]   Even strongly guarded information has exposures.  Another doubtful assumption is that improper modifications can reliably be detected by repository audits.

TDR enthusiasts might argue that they intend to manage only information that nobody will want to attack.  But how can they decide which information is an attractive target and which not?  Do they propose one method of archiving for cultural and scholarly documents and other, yet-to-be proposed methods for sensitive business, government, and private information such as their personal medical records?

I know only two ways to demonstrate information authenticity many decades after it was created.  One exploits public key cryptography.[14]  The other compares copies in autonomous dark archives with publicly accessible copies.  The dark archives must provide extraordinary protection for dark copies’ integrity.[15]  Is it prudent to consider either possibility as fail safe?  I don’t!

Is the TDO method correct and complete as described?  I think so, but don't know.  Repeated invitations to challenge its methodology have induced no plausible criticism.  Is there some better method for validating object authenticity than the TDO method?  None has been proposed.

EU Preservation Program (Planets) Software

In January, the Planets project[16] announced a set of LDP tools to be made available, including:

·       The Planets Preservation Planning Tool (Plato), to help organizations move from requirements assessment to action planning;

·       Two emulators, Dioscuri for simulating a practical computer environment and a Universal Virtual Computer (UVC) [17] for environment independent information representation;

·       A Preservation Characterisation Registry to identify characteristics of digital materials that are candidates for LDP.

·       The XCEL significant property extraction tool working on text, image, sound and some other formats.

·       A testbed, which is a controlled software environment for digital preservation experiments; and

·       A Planets Interoperability Framework for integrating Planets tools and services into a preservation system.  This is extensible to integration of third party tools and services.

iRODS—a New Archival Repository Implementation

The San Diego Supercomputer Center has announced that it is developing a new archival repository implementation that it calls i Rule Oriented Data Systems (iRODS) with its own documentation website.

Cost of Long-term Digital Preservation (LDP)

Digital preservation literature has paid too little attention to content-addressed storage technology (CAS).  CAS platforms are disk–based, object–oriented storage systems designed for the long–term retention of data that is not intended to be changed.

LDP cost considerations should include on-going data center costs associated with power and cooling.  An EPA report on data center  energy usage observes that data storage devices contribute the highest power consumption growth rate and the highest overall power consumption.  Richard Moore’s 13th slide of a San Diego Supercomputer Center presentation[18] at an NDIIPP meeting summarizes this issue.

Questions about Sustainable LDP and Access

Lavoies’ The Fifth Blackbird6 provides hints about the agenda and likely outcomes of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (BRTF).  When I first read it, my reaction was positive.  This optimism faded as I re-read it and discovered missing ideas.

·       The Fifth Blackbird portrays LDP economics as a funding problem, but suggests no solution.  The difficulties seem tiny compared to concerns at the beginning of the great depression.  Keynes’ 1930 reaction discussed knowledge as capital and emphasized cost-reducing technology.[19]  Couldn’t the BRTF seek technical ideas that make its funding concerns fade to insignificance?

·       Which of the archiving partitions will be the primary foci of the BRTF?  Which work components are viewed as most costly?  What is the relationship between cost issues and funding strategies?

·       The Fifth Blackbird strains to separate economic from technical issues, and consequently pays too little attention to technology’s potential for mitigating challenges.  Engineers, particularly those working in the for-profit sector, vigorously seek cost reduction; continued rapid progress will change content management immensely.  Innovations have often changed people’s roles, sometimes even eliminating professions.  When did you recently talk to a stenographer?