|
Digital
Document Quarterly Perspectives on Trustworthy Information |
Volume 3, Number 4, 4Q2004 |
|
|
|
|
|||
|
|
HMG
Consulting |
©
2004, H.M. Gladney ISSN: 1547-8610 |
In most DDQ numbers, I have emphasized objective over subjective comments, identified the latter as such, and suggested objective reasons for their occurrence. In the current number, this style has not always been followed to avoid obscuring some opinions by justifications that can be lengthy.
Readers’ opinions would be welcome, particularly contrary opinions accompanied by objective hints of their merits.
I have seldom seen ‘creeping democracy’ used to describe what is happening in the availability, usage, and generation of information. Yet it strikes me as an apt descriptor of transforming opportunities for anyone who chooses to grasp them.
The number of people who read, execute, create, and update information is large and growing. It is larger than ever before both absolutely and as a population fraction. This is only partly because of the amazing decrease in information costs.[1] It is also because citizens are better educated and have more time for discretionary activities than ever before.
Resources
for whose effective use many people have needed the help of specialists are
increasingly accessible to almost anyone.[2] Technology and economics are changing the
roles and methods of most knowledge workers and of all enterprises. End users’ dependencies on professional
mediation will continue to decrease.
Even personal digital libraries will become practical within a decade.[3]
Consider a different digital repository structure[4]—different than the one receiving most published attention—enabled by Internet pervasiveness, by massive affordable storage, and by software for file sharing without central servers or databases, software for a small group’s digital library that is more attuned to its needs than a university library,[5] and software for automatic bit-string replication.[6] A group library package that combined these technologies with rule-guided management routines could be deployed over thousands of small computers to replace the repository services of archival institutions.[7] Each group would specify the access and replication rules for the digital objects that it provides. Little further human administration would be needed.[8]
Progress towards a digital commons seems economically inevitable, because of the growth of the pool of skilled and affluent volunteers,[9] and because of the nature of technology development—particularly of software development. What we mean by ‘the nature of software development’ can be understood with the assistance of a technology layering illustration. At any moment, deployed digital technologies consist of layers from the most basic and most general services to an application-specific layer. What software engineers do in support of knowledge workers can be characterized as: (1) identifying certain human actions as “merely clerical”; (2) choosing a frequently used subset for automation; (3) generalizing this to be broadly useful; and (4) implementing the generalization as a new software layer.
Although the broad areas of progress seem obvious, the specifics are difficult to predict. From a distant perspective the change process appears chaotic. Many attempts are unsuccessful, not so much because of technical flaws as because the implemented offerings do not appeal to large numbers of users. This is partly because better offerings appear at roughly the same time. The whole process is subject to a sort of Darwinian selection.[10] Continuing rapid progress is likely.
Progress towards a digital commons has been accelerating. Perhaps this is because key economic thresholds have been attained—an immense Intenet and millions of WWW users, hard disk drives so inexpensive that it no longer much matters how much storage space is required, a good nucleus of publicly available information digitally represented, and so on. 2004 developments that probably introduce massive imminent changes include:
(1) Emerging competition in inexpensive search services, which are extending to desktop search and services tailored for particular communities (such as Google for scholars). Probably only a fraction of known search techniques have been exploited (the ACM Special Interest Group on Information Retrieval has been active for a quarter century, and its technical literature is huge.)
(2) Massive collections of content not encumbered by intellectual property constraints, particularly the Internet Archive collection and Google’s recently announced book digitization project.
(3) Open Courseware offered by The Masschusetts Institute of Technology.
(4) Advocacy group activity; see the Center for the Digital Future and the Creative Commons.
(5) Inexpensive streaming media services for news and for music.[11]
(6) Beginnings of “grass roots” news services; see WikiNews. Dan Gillmor, technology writer for the San Jose Mercury News, recently published on the theme.[12]
(7) Increasing efforts to make running servers “as easy as turning on a faucet”.
Internet activities are not decreasing people’s enthusiasm
for traditional libraries. Our town
library is probably typical; serving about 30,000 people, it seems to be
occupied by 50-100 patrons at any time, and twice that number after school
hours. The SJ Mercury News published the
following 2002-3 statistics for the
· 5.4 million patron visits—more than the combined attendance at San Francisco Giants’ and Oakland As’ baseball games.
· 13.5 million loans, nearly triple the number of 1994-95.
· 2.1 million holdings purchased since 1994-95.
· 400 computers with Internet access available to the public.
Free
availability of digital content continues to be resisted by the music and film
industries. Little of what they are
fighting to protect seems attractive to me.
I would prefer a different topical area to try such serious issues. [13]
The growing digital document flood exacerbates a readers’ challenge: separating wheat from chaff. We have not seen, but would welcome tools that automatically create a quality measure for each tested document. Such tools would have to work with whatever document forms impinge on prospective readers and be responsive to each reader’s personal rules defining quality estimates. Although semantic judgement is mostly beyond what automatic tools can accomplish, much is possible by analysis of the content and accompanying metadata.[14]
The
We need not speculate further. We believe imminent changes inevitable. Archival institutions would serve their interests best by confronting the possibilities squarely and participating in molding the future.
Much effort has been expended on metadata schema for describing library and archives records.[15] This effort is called into question by a Dick Bulterman’s Is It Time for a Moratorium on Metadata? (IEEE Multimedia 5(12), Dec. 2004.)
Bulterman’s point is that the effort is not matched by use of the schemes defined. Since I have been chastised by ACM referees for inadequately citing descriptive metadata literature, this article was particularly amusing. I believe that even non-specialists will find it both instructive and entertaining.
I hasten to remind readers that metadata might be needed for purposes beyond creating search indices, including information required for digital preservation.
Much digital content
that claims copyright protection is also candidate for long-term
preservation. The content for copyright
protection has been discussed by Nimmer[16]—if a work has represented in tangible form,
copyright protects the abstract pattern represented.
This much is enough in principle, but not in practice if the content or ownership of a copyright is contested in litigation. The content issue is the distinction between the pattern and accidental information that is part of the published instance; we are working towards an article analyzing this issue to suggest copyright owners’ measures.[17] The ownership issue is evidence—an audit trail that has been reliably protected against misrepresentation. This can be handled by metadata exploiting the following notions.[18]
(1) A digital representation models something other than itself. It models a pattern.
(2) Any model has both intentional features—the pattern—and accidental features. The copyright can be asserted to include the accidental features. The distinction between intentional and accidental is a matter of author's intent, which is undiscoverable by others except to the extent that the author articulates it, i.e., creates objective facts corresponding to what was subjective.[19]
(3) The pattern is eligible for copyright protection. It is protected in instances other than that in which it was fixed to establish the copyright claim. Copyright registration is not required.
(4) Also schema are models, and may themselves require further models that explain by reminding readers about how the words are used.[20] In modern jargon, the explications of schema are called "reference models" or "ontologies".
(5) Syntactic intentions can be conveyed with XML. Roughly one hundred XML schema definitions have been agreed on, e.g., MathML for mathematics and XBRL for business reporting. More are being considered for standardization.
(6) Semantic intentions can be conveyed by a knowledge management language. A prominent contender is RDF (Resource Description Framework). Linear RDF syntax looks like XML—you must look closely to distinguish one from the other.
(7) RDF segments can be embedded in XML documents, as can any bitstream.
(8) XML is today's wrapper of choice for creating "complete" bundles.
Presentation slides are available from an October 2004 workshop intended “to move forward archival and records management theory and find innovative ways to further develop fundamental principles of both disciplines.” Ken Thibodeau provided a current view of NARA digital archiving activitites and thinking, and Seamus Ross provided the European cultural counterpart. Slides on similar topics are available from a U.K. forum in the same month.
Information about the DSpace digital repository can be had from the MIT Libraries. It seems to be geared for large educational institutions. In contrast, the reader interested in digital libraries for small groups might find Greenstone from New Zealand’s University of Waikato interesting.
The reader might find it instructive to compare high-level structural depictions of digital library software—a picture originating in our 1993 IBM Digital Library design[21] extended to accommodate LOCKSS-style replication, a recent DSpace picture, and other similar pictures—asking what their similarities and differences teach.
The [
An accompanying Framework document relates a range of standards and best practice guidelines in all aspects of record keeping. TNA intends to implement both specifications in 2005, and solicits comments. The documents are available via The National Archives website.
… putting effective large-scale systems to actually carry out digital preservation activities. This means attention to social, economic, legal, and organizational as well as technical aspects of the digital preservation problem. Over the years, there has been a lot of focus on a magic bullet technology for digital preservation. Personally, I don’t believe one exists. We’ve seen various proposals on magic bullets, e.g., inscribing information on nickel-based storage that can be read in 10,000 years. It doesn’t get us very far when talking in terms of complex interactions and enormous databases. Emulation isn’t a magic bullet either—though I think it’s a useful tool in the toolbox of digital preservation techniques and technologies. [Lynch, RLG DigiNews 8(4), Aug. 2004]
The passage seems to imply that, because no single technique has been offered to provide demonstrably sound and complete digital preservation methodology, no comprehensive solution exists. The latter is simply incorrect.
As suggested by “toolbox of digital … technologies”, normal technical methodology begins by partitioning a challenge into pieces whose solutions can easily be combined. Hopefully, a few pieces will suffice. Apart from work alluded to in DDQ, no complete toolbox has been proposed.
The digital preservation literature is mostly from authors associated with research libraries and archives. As prior numbers of DDQ have discussed, this literature seems to assume that a collection of appropriately-managed institutional repositories will solve the problem. Several years of discussion have not produced a viable suggestion how that can be accomplished.
This institutional theme reappears in the October 2004 DPC Forum speakers’ comments. Anybody who believes the preservation solution is to be found in a method of managing digital repositories might profitably consider the following propositions and, for each that (s)he judges true, what it implies:
(1) Digital repository service is a different challenge than digital preservation.[22]
(2) Few computer scientists have been persuaded to think through digital preservation.
(3) The cultural heritage community has failed to work across disciplinary boundaries.[23] The "not invented here" syndrome is rife.
(4) Research libraries and archival institutions will not be leaders in determining their own digital futures unless they achieve significant prior changes in their internal attitudes, skills, and methods.
(5) The effective cost of deployed digital technology will continue to decrease exponentially, e.g., ~28% annually for persistent storage space.[24] Personnel costs will continue to increase.
(6) Implementing a digital preservation solution (such as TDO methodology) within existing information infrastructure can make non-technical problems (social and organizational) vanish and be seen not to have been problems at all![25] This is feasible without disrupting existing digital repositories. Much effort and money can be saved by eliminating certain current activities.
(7) LOCKSS (from Stanford) and Silverback/Tapestry
(from
(8) The boundary of a collection is a subjective choice. Any information collection will contain references to information that is not part of the collection.
(9) Correct rendering (for human consumption) of a collection member is likely to be dependent upon the correctness of other information objects, some of which might not be in the collection. Even if an object is protected so that its bitstream source is known to be authentic, changes in the objects on which its rendering depends might mislead its human user. For sensitive objects, this poses a security risk.
In early 2004, the NDIIPP managers requested comments on the Version 0.2 updates to the NDIIPP Technical Architecture (NTA hereafter—the Preliminary Architectural Proposal that was Appendix 9 of the 2003 NDIIPP Plan document).[26] Since I thought they would prefer a private discussion over a public one, on 19th May I wrote to Martha Anderson at the Library of Congress, with a copy to Laura Campbell. More than a dozen attempts to talk to one or the other have been ignored—not behavior appropriate for public officials spending your, and my, tax dollars.
Since I believe the criticisms in this letter should be acted on, but are being ignored, the next paragraphs reproduce the core of the May letter.[27]
The
NTA documents are without basic formal qualities that software engineers
expect. They are over a decade out of
date. They attempt design. What little they specify has been incorporated
in commercial software offerings since 1993, and Open Source software offerings
since about 2000. Such offerings provide
everything NTA calls for, and much more essential to repository institutions. Furthermore, the v.0.2 document is a step
backwards; instead of pushing towards standards and conventions needed for
inter-institutional collaboration, it retracts parts of what [the] 2003
Appendix 9 called for. Specifically:
(1)
An
NDIIPP objective is to help “the various stakeholders to be able to collaborate
on long-term digital preservation.” This
can be achieved only through digital objects sharing form for interchange
between institutions and between each institution and individual clients (both
content submitters and library readers).
However the NTA documents give information interchange scant attention,
focusing instead on repository structure that is important to each individual
institution, but that institutions do not need to share.
They
need to address format and protocol conventions that allow sharing without
hampering each institution’s autonomy unduly.
How to do this is known; specific details need to be worked out, but
doing so is a routine software engineering exercise that includes negotiation of
protocol and document representation standards.
(2)
The
NTA documents contain next to nothing addressing the needs of individual
clients—the intended beneficiaries. They
are almost silent about authors’ and readers’ needs.
(3)
NTA is written as if no
digital library technology existed. It
ignores the extensive literature on digital preservation. It ignores progress on information
interchange in the commercial world—progress that includes much that the
cultural collection community will surely use to accomplish NDIIPP
objectives. It even ignores standards
development to which the Library is a major contributor (e.g., METS and MODS).
It is
ironic that, representing the thinking of a professional community that works
to preserve reference material for scholarship, NDIIPP publications seemingly
make use of no prior work, as if no worthwhile prior work existed.
(4)
NTA
fails to distinguish between digital repository and digital
preservation. The former topic is
well developed, with software offerings that have been refined for about a
decade. NTA should presume such
offerings adequate for NDIIPP except for shortfalls that the plan specifically
identifies. However, the NTA authors
seem not to have looked at existing repository software.
(5)
Avoiding
the consequences of technological obsolescence and imperfect human (community)
memory—digital preservation—can and should be treated as a focal
topic. However, NTA is almost silent
about the preservation challenge!
Proposals exist, but are ignored by NTA.
Also ignored is a year-old EU/NSF study of digital preservation research
needed, even though some of its authors are among the NDIIPP advisors …. How
can an NDIIPP proposal ignore digital preservation?
(6)
The
NTA documents suggest that no statement of (technical) requirements has been
written by the NDIIPP team, even though more than three years have elapsed
since NDIIPP was funded! Parts of the
v.0.2 update begin to read as such a statement of requirements, but they
provide only a tiny fraction of what is needed.
(7)
Software
layering called for in the NTA documents is simplistic. More elaborate layering is essential. It is also provided in all content management
software offerings that I know.
Layering is alluded to (without naming
it) in section 3 (“Core Characteristics”) of the v.0.2 update in sentences that
include “hopelessly bloated”. How to
avoid bloating is known in routine software engineering practice. In early 2001, Deanna Marcum asked me to write
an analysis of commercial know-how pertinent to NDIIPP. Delivered in August 2001, this report
addressed layering and other NDIIPP technical needs better, in my opinion, than
the TSA documents. I am disappointed
that my work has been ignored with all the rest.
The
foregoing list of NTA document weaknesses is incomplete, but should serve to
convey why colleagues and I are deeply disappointed by the NDIIPP technical
component. We cannot help but feel that the expertise of the software engineering
community has not been effectively exploited. We do not know why this is the case, but feel
that such omissions must be corrected if the NDIIPP is to be effective and to
use public funds efficiently. [H.M. Gladney to NDIIPP managers, 19th
May 2004]
As far as I know, the NDIIPP managers have neither refuted these
criticisms nor acted to fix the alleged problems. Public comment by DDQ readers might help persuade
desirable improvements.
A recent New York Times front page article, Even Digital Memories Can Fade (10 November 2004), erroneously asserted that “[t]he problem of preserving digital photos and other electronic records for future decades confounds even the experts.” The false assertion, “no one has figured out how to preserve these electronic materials for the next decade, much less for the ages” is echoed by over 400 Web search “hits” (as of 28th December), even though we have published a technical solution. The columnist, Ms. Katie Hafner, cited only three Washington Beltway insiders and three apparent amateurs in the topic. My letters, first to Hafner and then to the NYT editors, have been ignored, perhaps because discussion of a problem attracts readers more than that of a solution!
In addition to Raymond Lorie’s publications that began in 2001, a paper in ACM Trans. Info. Sys. in July, and half a dozen readily available preprints, three more papers have been submitted to periodicals with high refereeing standards.[28] Trustworthy 100-Year Digital Objects: Durable Encoding for When It's Too Late to Ask (joint with Raymond Lorie) has passed a first round of criticism by ACM Transactions on Information Systems referees; a slightly amended version was sent to the editor in October. The Koninklijke Bibliotheek (The Netherlands) has deployed a pilot of the virtual machine technology this article communicates.[29]
Preserving
Digital Records: A Method Guided by Scientific Philosophy was submitted to Archivaria in November,
after its editor rejected a prior submission because it was more technical than
Archivaria readers were accustomed to.
The new version is written for professional archivists and research
librarians, and pays special attention to issues raised in prior Archivaria issues.
Principles for Digital Preservation, submitted to the Communications of the ACM (the periodical that reaches all ACM members) in November, has survived a first round of critical reviews.[30] This overview of TDO methodology had to conform to strict Comm. ACM limits: not more than 3000 words with 12 citations—not easy for a topic with many important details.[31]
DDQ has invited critiques of this work for about 18 months. No substantial technical problems have been identified to us.
The next step is to build a prototype of client workstation tools that ordinary users[32] will find convenient for packaging TDOs sealed with authenticity certificates, for extracting TDO payloads, and for inspecting evidence contained in TDO certificates and in the digital objects cited by the TDO.
There is little doubt that everything needed is technically feasible. Prototyping will have two objectives: (1) demonstrating that the technical complexities can be hidden so that ordinary users find the package convenient, and (2) as a step towards pilot installations. Since creating such a prototype will be more than a single person can do in a reasonable time, we have applied for funding for part of the work.
Retrospection of the Trustworthy Digital Objects papers suggests that their scope and constraints are nowhere stated as clearly and concisely as readers might want. The limitations include:
(1) TDO methodology addresses only the technical portions of digital preservation requirements.
(2) TDO methodology focuses on the most difficult anticipated cases[33] for which preservation might be wanted—file types for which perfect rendering is most difficult (probably computer programs) and records for which chicanery (record or provenance falsification) is tempting and can create immense risks for legitimate users. For relatively simple file types and for records not associated with large risks, other mechanisms than those we describe might be more economical.[34]
(1) TDO methodology begins with the observation that good repository software offerings have existed for some years. Some are almost adequate for their obvious role as part of any digital preservation solution, needing at most small extensions for long-term content.[35]
(2) In keeping with (3), TDO methodology deals only with methods for ensuring that bit-strings survive forever, that files remain useful forever, that eventual readers can test document authenticity, that ordinary people can create and can use durable records, and that the preservation toolkit can be installed and used without disrupting other software that people have chosen.
We emphasize that the TDO core is a representation and packaging methodology for digital objects, in contrast to most attempts towards a preservation solution.[36] We believe enabling long-lived repositories to be only a secondary objective—a means towards the proper objective, preserving digital objects.
The July - December 2004 issue of the DPC/PADI’s What's New in Digital Preservation bulletin reports recent work. It should be consulted by anyone working in the area.
BusinessWeek reports collaboration to catch phishers.[37] Big enterprises from Citibank to AOL will share data about ID-robbing cyberscams and boost government efforts to catch the malefactors.
Paul Horn, IBM Director of Research, has collaborated in a Physics Today article discussing physics research in support of information technology.[38] It suggests reasons why I am confident that we will enjoy at least another decade of amazing price/performance improvements in digital devices.
An
Any serious student of 20th-century philosophy should look at Alberto Coffa’s The Semantic Tradition from Kant to Carnap (Cambridge University Press, 1991). For those who have already read Wittgenstein’s work and logical empiricist tracts, it provides insight into the problems these authors confronted.
In contrast, Brian Magee’s Talking Philosophy: Dialogues with Fifteen Leading Philosophers (Oxford UP and BBC, 1978) has broad appeal—particularly to readers not familiar with the topic, but wanting an educated layman’s understanding of the issues and why these are universally important. This book originates in 1976 BBC television dialogues between Bryan Magee and fifteen outstanding thinkers. It makes even the most difficult ideas accessible to the general reader.
The DDQ 1(1) extrapolation of desktop disk prices from figures published by the New York Times suggests that $200 would buy 300-400 gigabytes in 2004. Recent offers such as a 120GB Western Digital DMA/100 HDD for $70 validate the 2 ½ year-old estimate. We expect its semi-logarithmic graph will project well for at least two more years, and that in 2007 you will be able to buy a persistent terabyte for $200!
Are you constantly short of screen space? Would you often like to invoke an application using an icon that is tucked behind an open window? Do you often hunt for Windows utilities? If so, you might like WINUSCON (Windows User Console), which collects application links into a tabbed pane that runs as an ordinary application. Download free from http://www.matirsoft.com.
I am trying out new software, TrustyFiles from RazorPop™, which promises a good single interface to multiple peer-to-peer file-sharing networks.
If you consult a supplier’s service technician, he is likely
to ask about the PC configuration within which your problem occurred. Free tools provide complementary detailed
information about the hardware and software of your PC. I’ve used BelArc Advisor for
about two years; its output is elegant.
The newer WinAudit
includes specifics about the applications running when it was
executed. You can view its audit report on screen as well as save it in text, web
page, XML and spreadsheet formats.
I strongly
recommend storing the outputs of these applications every few months, or when
you know your PC configuration to have changed significantly, keeping the prior
reports. This will cost only a few
moments, and might save you significant time and aggravation if a reliability
problem occurs.
Some corporations replace employees’ machines roughly every 4 years, because doing so is less expensive than maintaining old hardware and software. I recommend that home users do the same.
However, I just now recommend delaying any pending replacement until Intel’s new chipsets (e.g., Intel® 915G for mainstream users) become available in PCs at non-premium prices (perhaps in mid-2005). Any performance problems that you are experiencing (such