|
Digital
Document Quarterly Perspectives
on Trustworthy Information |
Volume 7, Number 2,
2Q2008 |
|
|
|
|
|
|
HMG Consulting |
© 2008, H.M. Gladney ISSN: 1547-8610 |
OCLC and Google have agreed to exchange book discovery data. Google will link from Google
Book Search to WorldCat, which will drive
traffic to online library services.
Google will also share digitized book data. WorldCat
will represent OCLC member library collections and link books scanned by Google. A user who finds a book in Google Book Search
will be able to use WorldCat to find local library copies.
Recent correspondence about archiving reminds me how difficult it is to communicate precisely. Writing is more difficult than conversation because no listener can signal confusion that a speaker might promptly correct. This challenge has been particularly evident in writing DDQ 7(2). Even though what follows has been repeatedly edited with advisors’ help, I am not as confident as I would like to be that readers will infer what I intend. The difficulty is even greater for documents in long-term storage (Figure 1).
One can reduce the communication difficulty by providing careful definitions and contextual information. However this remedy creates its own hazards—lengthy explanations that try readers’ patience, blizzards of detail that obscure central points, and seeming pedantry.
Such difficulties hamper community attempts to design information sharing tools, a current emphasis in digital library literature. Different authors use even well-known terms, such as “archiving”, differently. Partly for this reason, I understand only imperfectly what the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (abbreviated “BRTF” below) includes within its “sustainability” scope, or precisely which questions this group intends to answer.
As background for what follows, some DDQ terms of reference need to be explained. By “archiving”, DDQ means digital content management needed to ensure ready access to reliable records both immediately and in the distant future. It is useful to partition this into topics which, in good information system design, are only lightly coupled:
(a) Management prior to repository ingestion. This portion of digital object management is more evident for bureaucratic records than it is for cultural and scholarly works.[2] Bureaucratic records are typically generated, formatted, and managed to conform to pre-existing rules. Controls are less formal for other data. For concise reference, DDQ will allude to this portion as DocPrep;
(b) Core digital library services, being the functionality defined by a two-year old interface standard,[3] JSR 170. DDQ will allude to this portion as DocSS (as suggested by Figure 2 in DDQ 5(2));
(c) Near-term repository management, including all aspects of ingestion, curation, cataloging, access provision and business controls, and storage management—everything needed for content user services now and for roughly ten years. Typically this implements higher level digital repository services that rely on one or more DocSS instances. DDQ will allude to this portion as DocArch (see the second largest box in Figure 2 in DDQ 5(2), where it is labeled “Archival Store”);
(d) Long-term digital preservation, which is taken to be all measures required and/or undertaken to mitigate digital object unreliability caused by ravages of time, including human misfeasance, fading human memory, and technological obsolescence. DDQ has already called this LDP.
A common feature of these partitions is that each focuses on tools that handle target content directly. An emerging software category addresses
(e) Assisting human managers of repository institutions for planning their work, managing selection into collections, and signaling execution deadlines. For examples, see EU Planets tools below.
DocPrep is
important for bureaucratic records management, as in
the U.S. government, but might be of little interest to scholarly and
cultural repositories. DocSS is mature, with many COTS and open
source offerings, so that new R&D projects for this component would be
sensible only for specific enhancements, such as performance, scaling, or
reliability improvements. DocArch implementations are likely to
differ for different kinds of institutions; for instance, small colleges might
have different needs than the
Among reasons for treating LDP as a distinct partition is the fact that it can be developed and connected to the other components without much changing their implementations or disrupting installations that use them. The components (a) through (e) are high level partitions of archiving services. Each of these should be composed of several smaller lightly coupled components. Such partitioning into lightly coupled components is particularly helpful when the components have different maturities and different portability among installations.
What do we mean when we say that software modules are “lightly coupled”? We mean that a programmer responsible for one module can change its implementation without impacting coupled modules and without consulting with the programmers responsible for these other modules. The key is more or less formal agreements on the syntax and semantics of the interfaces that each module makes available to or uses from coupled modules. So-called APIs (application programming interfaces) are an agreement form that is useful between acquainted programmers. Interface standards, such as JSR 170 (Content Repository for JavaTM technology API), are more formal interface specifications. An interchange convention for sharing data objects via communications links, such as the protocol for Object Reuse and Exchange (ORE), is still another form.
|
|
|
Figure
2: Workflow for bureaucratic documents |
Digital archiving literature seems to be partitioned—articles about bureaucratic record handling (Figure 2), articles about managing cultural and scholarly articles (Figure 3), a beginning of articles about personal information,[4] and perhaps further partitions—with articles about one partition seldom citing those in the others. For instance, there seems to be little practical connection between work on ERA at the National Archive and Records Administration (NARA) and that on NDIIPP at the Library of Congress. To some extent, this is justified because formal rules and human roles are significantly different in the different partitions. An unfortunate side effect is little attention to synergism that could reduce the cost of tools and enhance information sharing between partitions.
|
|
|
Figure 3: Workflow for cultural
documents |
Portico and Ithaka’s survey of about 1000 U.S. library
directors identifies another partition, electronic periodicals. At the same time, A
Comparative Study of e-Journal Archiving Solutions has appeared. It makes evident striking differences of electronic
periodicals from other documents. Their
treatment is dominated by intellectual property law considerations. The authenticity of saved periodicals is
unlikely to be a big issue because the material is not a tempting target for
felonious modification and because any interesting periodical is likely to be
saved by many autonomous libraries. The
topics discussed in the study suggest that today’s urgent issues for
e-periodicals have more to do with near-term archiving (less than 50 years)
than with long-term archiving (more than 100 years).
LDP literature is more difficult than it might otherwise be
because different communities display different notions of worthwhile
research. If a computer scientist can
describe how to satisfy a service requirement, he would say it is not a proper
research topic. In contrast, the U.S.
NDIIPP plan reflects a common view that a research
topic exists for any information management need unsupported by available
software.[5] In IBM Research corridors in the 1980s, the
boundary between research and practical engineering was called “SMOP”—“a simple (or small)
matter of programming.” This did not
necessarily mean that the task being discussed was either uncomplicated or
inexpensive. Instead it meant that
computer scientists knew answers to its difficult questions, allowing most of
the work to be passed to a software development team. Patent law wording is apt; one cannot obtain
protection for an artifact or process design “obvious to someone versed in the
state of the art”.
[T]here has been relatively little discussion of how we can ensure that digital preservation activities survive beyond the current availability of soft-money funding; or the transition from a project's first-generation management to the second; or even how they might be supplied with sufficient resources to get underway at all. Lavoie[6]
The Blue Ribbon Task Force on Sustainable Digital Preservation (BRTF) has been described by the Director of the NSF Office of Cyberinfrastructure as “the only group I know of that is chartered to help us understand the economic issues surrounding sustainable repositories … ”. The BRTF web site declares one objective to be “a research agenda … [for] economic sustainability of digital information”. As suggested by the Lavoie quotation, this will surely include recommendations on how repository institutions can be funded and also how their running expenses can be minimized. Will it also be within the BRTF scope to suggest how research and development of LDP tools can be made more efficient and effective than is currently the case?
It seems to me that LDP progress would be accelerated if participants would
engage in more sharing of reusable modules than I am
aware of. Certainly, they often refer to
“modular architecture”. By copy of this DDQ number I am asking readers
to tell me about any open source LDP code they know of. I will also write to the larger LDP projects
to inquire. DDQ 7(3) will publish the information I receive.
I believe that digital preservation research funded by taxpayers has
been very wasteful, partly as a consequence of poor scholarship. Authors seem to pay little attention to what
is in the literature. What needs to be
said is perhaps controversial, but nevertheless under consideration to be a theme of DDQ
7(3). The problem is illustrated by
a JCDL 2008 paper.
When I first saw the A Data Model and
Architecture for Long-term Preservation,[7] I wondered if it described a special case of TDO
methodology.1 Since
this was not clear to me, and is still not entirely so, I e-mailed its authors
that I could not see what novelty their paper conveyed and requested
clarification. After two weeks without
an answer to this question, I annotated a copy with notes about apparent
problems, prior work, and missed opportunities.
I sent this to the authors, repeating my question. That netted a response mentioning end of term
workload and reminding me of copyright limitations. The authors have yet to react to the points
communicated.
Why don’t I merely ignore this paper?
It’s an example of much wasteful work—wasteful because authors don’t
build forward from prior work—even authors from prestigious institutions such
as the
On a positive note, JCDL 2008, in
which the criticized paper was presented, contains several papers whose ideas
might prove helpful for semi-automatic creation of metadata called for in the
TDO architecture.[8] Also note that the Bibliographical Center for Research
is inviting prompt critical
comment on its CDP Imaging Best Practices draft document. (The announced 13th June deadline
is “soft”.)
The National Archives and Records Administration (NARA) summarizes its public commitment by, “ERA will be a comprehensive, systematic, and dynamic means for preserving virtually any kind of electronic record, free from dependence on any specific hardware or software. … ERA will support the National Archives mission by making it easy for the public and government officials to discover, use, and trust the records of our government”.[9] Presumably this includes LDP as defined above.
NARA is overwhelmed by digital information, facing huge increases in both electronic records and classified records, according to Congressional testimony by National Security Archive director Tom Blanton. Blanton summarizes his problem list with,
[T]he National Archives
and Records Administration is a tiny agency with … overwhelming
challenges. …
He recommends:
Congress should order
Compare a recommendation
in Economics
and Engineering for Preserving Digital Content, quoted below.
Blanton emphasizes
difficulties caused by classified records, which are peculiar to government
data. Of more interest to most DDQ
readers might be NARA’s Electronic Records Archives (ERA)
project,[10] whose
largest expenditure is a Lockheed Martin (LM) contract for about $300M.
Reports of a May 14th
U.S. Senate hearing and some private rumors led me to wonder whether
the ERA project was experiencing serious difficulties. So I drafted some harsh paragraphs for this
DDQ number and shared them in a letter to Dr. Weinstein, the Archivist of the
I am still uneasy about how well ERA will meet its objectives, but have no evidence for this unease. The original LM delivery commitment had been September 2007; actual delivery is expected this month (June 2008). Since such delays are common for big software, this delay does not itself worry me. We’ll see whether there is reason for DDQ to comment in some future number.
In view of the many archiving articles that cite OAIS as a sort of “good housekeeping sign of approval”, readers might be interested in a critical look at how OAIS is used. Alexander Egger has written about shortcomings of the model.[11]
Enthusiasts for the TDR approach[12] might not believe repeated assertion that it depends on unrealistic assumptions. One such is that a stored object can be protected for decades or longer against felonious modification. This is called into question by a BusinessWeek probe of attacks on America's most sensitive computing resources.[13] Even strongly guarded information has exposures. Another doubtful assumption is that improper modifications can reliably be detected by repository audits.
TDR enthusiasts might argue that they intend to manage only information that nobody will want to attack. But how can they decide which information is an attractive target and which not? Do they propose one method of archiving for cultural and scholarly documents and other, yet-to-be proposed methods for sensitive business, government, and private information such as their personal medical records?
I know only two ways to demonstrate information authenticity many decades after it was created. One exploits public key cryptography.[14] The other compares copies in autonomous dark archives with publicly accessible copies. The dark archives must provide extraordinary protection for dark copies’ integrity.[15] Is it prudent to consider either possibility as fail safe? I don’t!
Is the TDO method correct and complete as described? I think so, but don't know. Repeated invitations to challenge its methodology have induced no plausible criticism. Is there some better method for validating object authenticity than the TDO method? None has been proposed.
In January, the Planets project[16] announced a set of LDP tools to be made available, including:
· The Planets Preservation Planning Tool (Plato), to help organizations move from requirements assessment to action planning;
· Two emulators, Dioscuri for simulating a practical computer environment and a Universal Virtual Computer (UVC) [17] for environment independent information representation;
· A Preservation Characterisation Registry to identify characteristics of digital materials that are candidates for LDP.
· The XCEL significant property extraction tool working on text, image, sound and some other formats.
· A testbed, which is a controlled software environment for digital preservation experiments; and
· A Planets Interoperability Framework for integrating Planets tools and services into a preservation system. This is extensible to integration of third party tools and services.
The
Digital preservation literature has paid too little attention to content-addressed storage technology (CAS). CAS platforms are disk–based, object–oriented storage systems designed for the long–term retention of data that is not intended to be changed.
LDP cost considerations should include on-going data center
costs associated with power and cooling.
An EPA
report on data center energy usage
observes that data storage devices contribute the highest power consumption
growth rate and the highest overall power consumption. Richard Moore’s 13th slide of a
Lavoies’ The Fifth Blackbird6 provides hints about the agenda and likely outcomes of the Blue Ribbon Task Force on Sustainable Digital Preservation and Access (BRTF). When I first read it, my reaction was positive. This optimism faded as I re-read it and discovered missing ideas.
· The Fifth Blackbird portrays LDP economics as a funding problem, but suggests no solution. The difficulties seem tiny compared to concerns at the beginning of the great depression. Keynes’ 1930 reaction discussed knowledge as capital and emphasized cost-reducing technology.[19] Couldn’t the BRTF seek technical ideas that make its funding concerns fade to insignificance?
· Which of the archiving partitions will be the primary foci of the BRTF? Which work components are viewed as most costly? What is the relationship between cost issues and funding strategies?
· The Fifth Blackbird strains to separate economic from technical issues, and consequently pays too little attention to technology’s potential for mitigating challenges. Engineers, particularly those working in the for-profit sector, vigorously seek cost reduction; continued rapid progress will change content management immensely. Innovations have often changed people’s roles, sometimes even eliminating professions. When did you recently talk to a stenographer?