Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 3, Number 2, 2Q2004

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2004, H.M. Gladney

 

ISSN: 1547-8610

 

“It is with philosophy as with religion: men marvel at the absurdity of other people's tenets, while exactly parallel absurdities remain in their own, and the same man is unaffectedly astonished that words can be mistaken for things, who is treating other words as if they were things every time he opens his mouth to discuss.  No one … will deny that the mistaking of abstractions for realities pervaded speculation all through antiquity and the middle ages.  The mistake was generalized and systematized [by] Plato. The Aristotelians carried it on.  Essences, quiddities, virtues residing in things, were accepted as a bonâ fide explanation of phenomena.  Not only abstract qualities, but the concrete names of genera and species, were mistaken for objective exist­ence.”                                                                                  John Stuart Mill[1]

Abstract

The main section below, Topics in Digital Preservation of Knowledge, argues that the technical component of digital preservation research and development should focus on the design and management of digital objects ‘hardened’ for durability.  It leads to today’s most difficult conceptual question: “How can today’s information producers represent their output so that its eventual consumers might be able to understand the meanings that these producers intend to convey?”

To suggest how to proceed towards practical digital preservation, this section combines prior DDQ material with analyses of economic projections and technical trends.  For research into answering the difficult conceptual question, we believe that the soundest foundation is early 20th century theories of empirical knowledge.  We identify seminal works that collectively seem a sufficient basis—works by Wittgenstein, Carnap, Quine, Popper, and Nimmer.  We recommend specific ideas from these sources as starting points for research into preserving knowledge.

Basic Topics in Digital Preservation of Knowledge

Information scientists concerned with digital preservation seem to have focused on repository functionality and management.  In contrast, DDQ has consistently focused on preserving documents, partly because digital library technology is well understood and presents no conceptual preservation issues.  My colleagues and I believe the focus on repositories, rather than on preserved objects that repositories manage, to be misplaced and not in the best interests of archival institutions or their professional staffs.

Our architectural focus is driven by economic trends and deployed information network characteristics.  Since these apparently do not much influence most other digital preservation thinking, we sketch a subset below and suggest why they drive us in directions more fully articulated in our Trustworthy 100-Year Digital Objects (TDO) papers.  Since the Trusted Digital Repositories[2] and the [U.S.] National Digital Infrastructure Preservation Program (NDIIPP) reports represent Digital Library Federation consensus and identify a funded technical plan,[3] we take these as the articulation of the focus we believe misplaced.

The reader might be surprised at the length and depth of our analysis of design imperatives sketched below.  Our care and the reader’s critical attention are mechanisms for avoiding, or discovering and repairing, systematic errors.  They are driven by (1) software engineering precepts suggesting that high quality is favored by spending relatively more time on design and less on implementation and deployment; (2) recognition that careful design is inexpensive compared to lifetime utilization costs; and (3) expectation that errors embedded in preserved digital objects might not be noticed by their eventual consumers and, if noticed, not be correctible by them.

We cannot be confident that our preservation method is sound without a profound understanding of what we mean by ‘knowledge’.  We need a thorough analysis of preservation objectives that includes an answer to, “Precisely what is it that creators want to communicate to future generations?”  This is not in the sense of selecting which documents we want to preserve, but rather in a sense suggested by Levy.[4]  The best basis we know for confidence in methodology is certain early 20th century analytical thinking.

Some Economic Trends and Technical Imperatives

"Prediction is difficult.  Especially about the future."  —attributed to Niels Bohr

We are in the midst of widespread changes[5] in how people interact with information, how it affects their lives, and how information will be managed in a networked world.  The information science literature about digital preservation pays less attention to economic factors and technical trends than to examining how current paper-based repository methods can be adapted to a digital world.

The shift of information search from library services to private sector services might be a harbinger for further disintermediation.[6]  For instance, academic faculty members and private individuals often provide superb information organization and deliver this directly to consumers.[7]  Librarians and library institutions might believe disintermediation undesirable—socially as well as for their professional and institutional futures.  If so, they need to start leading part of the information revolution rather than merely following.

Some trends are well known, at least in the sense that they are often mentioned in the business literature.[8]   Some whose strategic consequences bear thinking about are: [9]

·          The number of people with education, leisure, and interest in reading and writing is much larger than it has ever been, even as a fraction of the total population, and is growing.[10]

·          Many people younger than 30 years tend to be more comfortable with digital technology, and more skilled in its use, than most people 50 and older. The latter group includes most of the decision makers in libraries and archives.

·          Digital technology is becoming affordable in lesser-developed countries, some of whose people are becoming Internet users, particularly in China[11] and India.

·          The amount of digital information that might be preservation-worthy is growing rapidly.  The many estimates[12] suggest that the portion represented by research library collections is small and shrinking.

·          The number of contributors to information management and search technology is much greater than the digital staff of traditional libraries and archives, and growing.

·          The information services industry is changing rapidly to exploit the Internet[13] and to provide scaling to very large digital object collections.

·          Any document is potentially linked to many documents of other kinds.  We cannot partition the world’s collections into unconnected partial collections.  For instance, we can neither define an impervious boundary between cultural documents and business records, nor segregate picture collections from pure text files.

·          Every document contains references to other documents that are essential to its interpretation and provenance evidence.[14]  These references might not be explicit.[15]  We represent such references as citations, or links, or pointers.

·          The content of any collection or individual document is the result of some individuals’ subjective choices.  As a consequence of this and the prior points, there is no structural distinction between a document collection (a.k.a. ‘library’) and an individual document.[16]

·          The information quality and evidence of authenticity that people expect has increased steadily since early in the 20th century (when radio broadcasts and music recordings became popular).

Figure 1: How Much Storage will $200 Buy?  The 2004 point ihas been added to the NY Times chart.

·          Automation is now inexpensive compared to human labor.[17]  For instance, see Figure 1.  It is reasonable to plan a home computer with a terabyte of storage!

·          Information consumers, information producers, and information service providers will not change their tools to accommodate digital preservation, except for very modest upwards-compatible modifications.  The provider who plausibly promises the least disruption will win.

·          Many applicable technical specialties are highly refined, with their own extensive and deep literature, and active interest groups.  For instance information retrieval is represented by ACM SIGIR.

Librarians have been thorough in investigating what history teaches about preservation.[18]  They might balance that by similar care in looking forward.

Some Preservation Requirements and Their Consequences

Archivists have more than once changed their collective opinion about what information representations are worth preserving.  Levy and O’Toole suggest that it is time for another change, without specifying the nature of the change precisely.[19]

Large-scale digital preservation will be affordable only if we automate every human processing step that can be replaced by a machine procedure.  However, we should not preclude any human intervention based on human judgement and values.

The literature suggests practical urgency because older digital content is already being lost.  A second urgency is that metadata—provenance information needed to convert a document into an archival record—are best created and packaged with the active participation of each document’s creator(s).

Protocol and data representation standards for information interchange are a key focal topic for digital preservation.[20]  At some level, all documents being interchanged must share structural schema.

To avoid troublesome ambiguities of reference, we must assign a unique reference name (a.k.a. ‘identifier’) to each digital object.  We often find it useful to assign more than one name to an object.[21]

Integrating digital networks from lightly coupled components has been common for more than a decade.  For content management the accepted infrastructure components that need to accommodate long-term preservation include:


·          File storage management,

·          File replication,

·          Primary catalog management in relational databases (DBMSs),

·          Search index management,

·          Search engines,

·          File formats conforming to international standards,

·          Metadata conforming to international standards,

·          Access control and digital rights management services,

·          A document storage subsystem binding files and catalog records (see Figure 2), and

·          A document manager layer in which all local customization is implemented.


Figure 2: Relationships of components of a digital document repository.  In contrast to the usual usage of ‘trusted’ in “Trusted Digital Repositories”, the usage here is correct.[22]

System layering[23] is essential to partition technical responsibilities, to enable software porting across hardware and operating system platforms, and to permit customization wanted by different institutions and sometimes by individual users.  Figure 2 suggests some of the layering and some of the functional components of content management services.

OAIS permits differences between a ‘Submission Information Package (SIP)’ and its corresponding ‘Archival Information Package (AIP)’ and ‘Distribution Information Package (DIP)’.  However, as can be inferred from Figure 3, to ensure that the document representation that a consumer receives is independent of network path by which it reaches him,[24] each DIP needs to be identical to its corresponding SIP.  Repository clients (producers and consumers) will not care how AIPs are represented.

Figure 3: Digital object paths from producer to consumer.  Copies of a particular object might reach the consumer by paths that he cannot control and that might be different from time to time.

Repository institutions should work to encourage content producers to submit objects already packaged for preservation to share preservation costs, to exploit producers’ knowledge and competence, and to mitigate the challenges of scaling to large collections.

Preserving Knowledge

“Documents are talking things. … The brilliance of writing is the discovery of a way to make artifacts talk, coupled with the ability to hold that talk fixed, so that a fixed, stable message can be carried through space and time.  It is something that documents do well and people by and large don't.  It is not that we are incapable of performing in such a manner … but it is not of our essence to do so. Yet it is exactly of the essence of documents, a defining characteristic.”                                                               David Levy[25]

Figure 4: Data, information, knowledge, and understanding[26]

Assuming that what we want to preserve is knowledge, we might start by agreeing what we mean by ‘knowledge’.  Popper’s ‘World 3’ definition (see below) is particularly apt, and consistent with modern articulations such as that suggested by Figure 4.

Beyond that, what we know in principle about the technical parts of digital preservation includes:

·          How to protect information packages from being lost.

·          How to package information so that its eventual users can reliably test its trustworthiness.

·          How to encode information so that it can be rendered reliably.[27]  In this context, ‘rendering’ includes execution of computer programs.

An open engineering challenge is illustrated by word processor documents whose users want preservation of all possible renderings.  Saving ‘.doc’ (e.g.) files is not enough, since the renderings are articulated by vendor software that includes operating system components and other vendors’ device drivers.  Extracting and saving the necessary programs is made difficult by vendor secrecy.

The most difficult previously expressed digital preservation objective[28] is “ensuring that information consumers can read or otherwise use each preserved object as completely as its producers intended.”  Accomplishing this is, in principle, impossible for at least some data types.  A prudent revision of the challenge is, perhaps, “how can producers today represent preserved information to minimize each eventual consumer’s misunderstandings of what these producers intended to convey?

What sound basis exists for choosing how to convey digital documents?  Arguably, the best available foundations for analysis are found in early 20th century thinking.  Provisionally, a sufficient selection is:

1.      Ludwig Wittgenstein’s Tractus Logico-Philosophicus distinction between objective and syntactical concerns, on the one hand, and subjective and semantic concerns, on the other hand.[29]  His Philosphical Investigations teaches that every use of language—a word, a sentence, a report, a book—is comprehensible only in the context of innumerable other communications.[30]

2.      Rudolf Carnap’s The Logical Structure of the World[31], which starts with a pragmatic notion of ‘object’: 

“The word "object" is here always used in its widest sense, namely, for anything about which a statement can be made. Thus, among objects we count not only things, but also properties and classes, relations in extension and intension, states and events, what is actual as well as what is not.”                                    
                                                                                                          The Logical Structure of the World, §1.

Carnap grounds a small number of objective definitions in ostensive use of relations and outlines a construction method for articulating more complex objects.

3.      Karl Popper’s 1967 essay Knowledge: Subjective versus Objective,[32] which includes:

“… without taking the words `world' or `universe' too seriously, we may distinguish … first, the world of physical objects or of physical states; secondly, the world of states of consciousness, or of mental states, or perhaps of behavioural dispositions to act; and thirdly, the world of objective contents of thought, especially of scientific and poetic thoughts and of works of art. 

“… consider two thought experiments:

“Experiment (1).  All our machines and tools are destroyed, and all our subjective learning, including our subjective knowledge of machines and tools, and how to use them.  But libraries and our capacity to learn from them survive.  Clearly, after much suffering, our world may get going again.

“Experiment (2).  As before, machines and tools are destroyed, and our subjective learning, including our subjective knowledge of machines and tools, and how to use them.  But this time, all libraries are destroyed also, so that our capacity to learn from books becomes useless.

“If you think about these two experiments, the reality, significance, and degree of autonomy of world 3 (as well as its effects on worlds 1 and 2) may perhaps become a little clearer to you.  For in the second case there will be no re-emergence of our civilization for many millennia.”

4.      Willard Orman Quine’s Word and Object teaches how to map normal language usage to relatively unambiguous forms inspired by formal logic.

“According to an influential doctrine of Wittgenstein's, the task of philosophy is not to solve problems but to dissolve them by show­ing that there were really none there.  This doctrine has its limita­tions, but it aptly fits explication.  For when explication banishes a problem it does so by showing it to be in an important sense unreal; viz., in the sense of proceeding only from needless usages [of language].  

“…

“It is ironical that those philosophers most influenced by Wittgen­stein are largely the ones who most deplore the explications just now enumerated.  In steadfast laymanship they deplore them as de­partures from ordinary usage, failing to appreciate that it is pre­cisely by showing how to circumvent the problematic parts of or­dinary usage that we show the problems to be purely verbal.”      Word and Object, §53.

5.      David Nimmer’s Adams and Bits: of Jewish Kings and Copyrightsb[33] identifies what can be protected, and therefore much of what is worth preserving.

“News Item: Fire swept through the converted grain silo that Naomi Marra has called home …  Feared lost among the charred ruins is the last extant copy of her lyric ode, Ruthless Boaz.  … devotees hope that, following her many public declamations of the work, most or all of it may remain preserved in her memory.      Query: Is Ruthless Boaz still subject to statutory copyright protection?”

With this hypothetical case, Nimmer analyzes the protection of intangible value—patterns inherent in the reproductive instances of each document.[34]  The essential patterns of a document are those needed to allow it to be Levy's “talking thing”.[35]

For the purposes at hand, we need not read earlier than 1920.  Collectively, Wittgenstein, Carnap, and Quine acknowledged and progressed from the work of Emmanuel Kant, Auguste Comte, Heinrich Hertz, Karl Weierstrass, Ernst Mach, Gottlob Frege, David Hilbert, and  Bertrand Russell.  All later epistemological thinking was based on the work of these masters.

Current Topics in Digital Preservation

The treatment above emphasizes permanently significant aspects of long-term digital preservation.  It provides part of the reasoning that leads us to believe that the architecture described in our Trustworthy 100-Year Digital Objects work is forced by the existing information infrastructure and by end users’ needs.[36] 

What follows summarizes current activities of more than average interest.

Preserving Personal Pictures and Records

PC Magazine discusses “ways to ensure that the contents of your discs are readable down the road and how to set up a backup plan to keep your archives safe”.[37]  The guidelines tell the home computing enthusiast how to preserve digital photographs and personal data for 25 years or longer, i.e., at least until the results of current preservation research are embodied in practical and inexpensive products.

A second PC Magazine article, The Dead-Media Bogeyman includes:

“… there has to be some concern over the long-term reliability of digital storage, with the recent overblown fears about disc rot—a perceived problem that harks back to the late 1970s and some bad pressings of laserdiscs.  In fact, we are witnessing a consolidation process resulting in more and more backups.  And because we tend to use music industry–type (CD and DVD) consumer standards, we will probably have playability for a hundred years or longer.  There is not as much dead media in the music industry; just consider that with the right equipment you can still play a 78-rpm record from 1904!” John C. Dvorak[38]

Such information calls into question alarmist expressions of the urgency of digital preservation.[39]

Progress in Information Interchange Standards

OASIS has identified “a common set of file formats … for free software on the desktop.”[40]  This complements the PRONOM service offered by [U.K.] Public Record Office.

The presenters at METS Opening Day West and the RLG Members Forum on Metadata and Digital Repositories were persuasive that METS (the Metadata Encoding and Transmission Standard) is sufficiently advanced for institutional commitment.[41]

TASI (the Technical Advisory Service for Images) is providing a guide to metadata vocabularies that links to over 60 formal vocabularies with introductions to classifications, subject headings, and thesauri.

Validation of Old Cryptographic Public Keys

Our paper about making preserved information trustworthy[42] proposes basing any consumer’s authenticity testing of a preserved document on the authenticity of a cryptographic key.[43]  The paper also proposes a way by which a small number of public institutions can certify these keys so that such certifications are highly unlikely to be falsified.  Our method works because its execution is easily controlled administratively, because it is easy and inexpensive to apply, and because responsibilities are partitioned so that it would be against the interests of certifying institutions to permit fraud.

After the cited paper had been committed to press, we discovered that Waugh suggests two other methods of showing that a particular public key belonged to a particular signer at the time a preserved object was signed.[44]  In the first method, keys can be served by an institution that is trusted to safeguard such keys faithfully—a service that is much easier to provide and much less risky to end users than purported “trusted digital repository” service.

In the second method, a well-known publisher might use the same certification key-pair for many works.  The user interested in the authenticity of a work issued could check that its public key value is identical to that of a body of works from that publisher.  This is likely to be acceptable to a user who is satisfied by knowing that the work is truly from the particular source alleged.[45] 

On a related topic, Boneh and Franklin have invented a way of choosing asymmetric keys whose public portion is easy to remember.

Good Scholarship Sadly Lacking

The accepted order in research and development has universities engaging earlier than industry, particularly when new ideas are needed, and sometimes when existing ideas are insufficiently tested for acceptable commercial risk levels.  Open-source digital repository software is emerging from universities about a decade after similar technology has had marketplace success[46]—suggesting that the opposite order is becoming common.

The following critique of a Humboldt University announcement illustrates a frequent failing—not considering prior work as part of a new project, especially if the prior work comes from a discipline other than that of the authors.  Taxpayers might ask why the Library of Congress, in its NDIIPP plan[47], seems to be proposing yet another digital repository project—one whose published initial plan is not only vague, but a decade out of date, since at least one similar, but more specific, design was published in 1993.[48]  It is disturbing that the NDIIPP plan makes little allusion to any prior technology.

A “Johnnie come lately” Digital Repository Project

Many digital preservation R&D projects are wasteful, except perhaps as team learning exercises.  The text of a recent announcement illustrates typical weaknesses.  Humboldt Univ. (Berlin) has announced a “Center of Excellence for Trusted Digital Repositories (TDRs)” whose “goal … is the development and implementation of a Trusted Digital Repository based on the Reference Model for an Open Archival Information System (OAIS).  Phrases from the announcement[49] raise questions:

(1)     development and implementation of a TDR”.  What remedies are proposed for TDR weaknesses that DDQ communicated two years ago to the authors of the Trusted Digital Repository report?[50]

(2)     based on the ‘Reference Model for an Open Archival Information System (OAIS)’” and “the OAIS model is widely accepted among managers of digital information as a basis for a technological solution of the addressed problems”.  OAIS defines a digital repository ontology—language for communicating the identity of a topic of discussion, and is almost silent about architecture.[51]  Can one infer anything more from consensus about OAIS other than that its proponents are discussing the same topic?[52] 

(3)     Keeping data formats readable in the long run is a …yet unsolved problem.  Much of this problem has been solved in principle,[53] but apparently the solution is not widely known or accepted.  Nor is the solution fully tested or demonstrated, partly because there has been little funding.

(4)      reliable … TDRs are capable to solve the key issues of long-term preservation of digital information.[54]   Who has demonstrated feasibility of a reliable TDR?

(5)      market the system as well in the open source community as in the commercial field.  In view of established commercial and open-source repository software offerings, what Humboldt-Center features are proposed as an incentive to potential customers?

(6)     The data are long-term protected against loss, change or damage.  What protection do TDRs offer against malfeasance or mistakes by repository employees?  How is this accomplished?  How can a user, a century from now, establish confidence that data received from the repository has not been improperly modified?[55]

(7)     Storage and retrieval … can be done worldwide,” and “the repository can be distributed worldwide.  [Its] management is centralized to [minimize] administrative workload.”  Existing digital library and content management offerings provide this.  Some have done so for about a decade.  Hundreds of initiatives target making digital deliveries easier to use.  What does the Humboldt team propose that’s new?

News and Commentary

NDIIPP Events

The Library of Congress announced a joint project with Old Dominion Univ., Dept. of Computer Science; Johns Hopkins Univ., Sheridan Libraries; Stanford Univ. Libraries & Academic Information Resources[56]; and Harvard Univ. Library to explore strategies for the ingest and preservation of digital archives.  The Archive Ingest and Handling Test (AIHT) is to identify, document and disseminate working methods for preserving cultural materials, and to identify R&D topics.

Acting on behalf of NDIIPP, the National Science Foundation (NSF) has called for preservation R&D proposals in Digital Archiving and Long-Term Preservation (DIGARCH).  Related information can be found in an NSF-European Union report entitled Invest to Save: Report and Recommendation of the NSF-DELOS Working Group on Digital Archiving and Preservation, which was reported in DDQ 2(4).

British 19th Century Newspapers To Be On-line

The Higher Education Funding Council is providing £2M to digitize British 19th century newspapers for service from a Web site.

Reading Recommendations

Viktor Kraft’s the Vienna Circle[57]

“The Vienna Circle … led to a rebirth and reformation of positivism and empiricism.  Neo-positivism stands in the foreground of contemporary philosophy, especially in the anglo-saxon and scandinavian countries.  It may safely be said to be the most significant of serious philosophical movements in the period between the two world-wars.”                                                            The Vienna Circle, Introduction

Kraft was a young Vienna Circle member who later succeeded to the philosophy chair that had been held by Ernst Mach and Moritz Schlick.  This book is an easy-to-read introduction to logical empiricism.

Willard Van Orman Quine’s Word and Object[58]

"The uniformity that unites us in communication and belief is a uniformity of resultant patterns overlying a chaotic subjective diversity of connections between words and experience.  Uniformity comes where it matters socially; hence rather in point of intersubjectively conspicuous circumstances of utterance than in point of privately conspicuous ones."                                                                            Word and Object, §2

The literary output of Quine, mathematician and Harvard Professor of Philosophy from 1956-2000, was prodigious. "Word and Object" deals with ostensive knowledge and language as a phenomenon of social sharing.  The book, which uses little visible mathematics, seems readily accessible to anyone.

 “…  We are accus­tomed daily to paraphrase our sentences under the stress or threat of failure of communication, and we can continue thus. … The purpose of the study is to bring the refer­ential business of our language more clearly into view.

“Vagueness is a natural consequence of the basic mechanism of word learning.  The penumbral objects of a vague term are the objects whose similarity to ones for which the verbal re­sponse has been rewarded is relatively slight.  Or, the learning process being an implicit induction on the subject's part regarding society's usage, the penumbral cases are the cases for which that induction is most inconclusive for want of evidence. The evidence is not there to be gathered, society's members having themselves had to accept similarly fuzzy edges when they were learning.”                                                     Word and Object. §26

Discussions of Open Source Software

People often prefer known risks to those associated with new territory.  Uncritical enthusiasm for open source software seems to be a contrary example.  This is partly conditioned by an unspoken assumption that what works well in personal workstations will also work well in information servers such as digital repository implementations—an assumption calling for critical caution.[59]  Some insight into the issues is provided by the Open Source Grows Up special issue of ACM Queue, whose articles include:

Is OS Right for You? (A Fictional Case Study):  Your team has added open source code to a key company project.  The players: a now-dead branch of the code tree, a teeny bug, and an irate CTO.”

Open Source to the Core: “Are you considering adding open source code to a project?  An open source guru Jordan Hubbard tells you what you need to know to get started.”

From IR to Search, and Beyond:How is the evolution from information retrieval to text mining impacting the information workspace?”

There's No Such Thing as a Free (Software) Lunch: “A techie-cum-lawyer breaks down what everyone needs to know about open source licenses.”

Practical Matters

Competition in Office Application Software Packages

Less expensive alternatives for Microsoft Office™ are being refined.  See:

·          Eleven Tips for OpenOffice at http://linuxjournal.com/article.php?sid=7158

·          Corel Word Perfect 12 at http://eletters.wnn.ziffdavis.com/zd1/cts?d=75-178-1-1-497479-8161-1

·          Java Desktop 2 at http://eletters.wnn.ziffdavis.com/zd1/cts?d=75-178-1-1-497479-8164-1

·          A review comparing Office 2003 and OpenOffice.Org at http://eletters.eweek.com/zd1/cts?d=79-662-1-5-70562-76888-1

·          IBM’s Lotus Workplace strategy, with e-mail, word processing, spreadsheet and database applications aimed at business at http://ct.com.com/click?q=8f-OOYvQ8KIO_U7RjMCtdkvtiT3W7eR

·          A BusinessWeek review of Simdesk™.

Portable Document Scanner

Since the university library that provides my more arcane source material, is an hour’s drive each way, and since part of its holdings cannot be borrowed, I’ve been using a digital camera to capture excerpts needing careful attention.  I’d like to convert the images to searchable text, but cannot currently do so because my camera (an Olympus E100-RS) produces 72-pel images—too low resolution for ScanSoft Omnipage™.[60]

I therefore pay attention to promotions of portable scanners, such as the $200 DocuPen™.  However, its reviews report sensitivity to wand positioning, poor clarity, poor life for its expensive custom batteries, and exaggeration of its advertised storage capacity.  I’ll continue waiting for a suitable device.

Recent Web Browsers

I’ve been pleased with the Opera Internet Browser™ as a replacement for MS Internet Explorer™.[61]  Its $39 price was a good investment.  However, “’better than’ Is The Enemy of “’good enough’”.[62]  A March Infoworld review led me to Mozilla Firefox™.  I can attest that its “Why You Should Switch to Firefox” description is an honest report.

Since then I saw strongly favorable reviews of both the MyIE2 browser and the Avant Browser.  Since, their novel features are similar to those of Mozilla Firefox, I have not tried them.[63]  Just as this DDQ number was being completed, favorable Slim Browser reviews appeared.  Each of these browsers is available for downloading without expense to users.

Home Computing Technology and Price Watch

In April, Business 2.0 had reported that, due to recent increases in production capacity, the cost of thin liquid crystal displays had dropped approximately 40% since 2001, and projected that LCD display sales would bypass CRT display sales in 2004.  In May, BusinessWeek estimated that both DRAMs and liquid crystal displays would be in short supply throughout 2004.  In June, the Semiconductor Industry Group estimated that overall chip sales would increase 29% this year.[64] 

We therefore believe that prices of main memory, other chip sets, and flat panel displays will not change much in 2004.  In contrast, best prices for hard disk drives (HDDs), for optical disk drives and media, and for wireless LAN link adapters have plummeted and may be expected to continue to drop.  Best buys in San Jose, illustrated below, are probably loss leaders.[65]

HDD (internal)

Seagate 120Gb 7200rpm 8.5msec

$48.

$0.40/Gbyte

HDD (external USB)

Hitachi 200Gb 7200rpm 8.5msec

$75.

$0.38/Gbyte

CD-RW drive

Pacific Digital 52x24x52

$23.

each

DVD-ROM drive

Brand unknown—16x

$21.

each

DVD-RW drive

HP 300i—4x dual ±R/±RW

$65.

each

DVD-R blank disks

Brand unknown

$.31

each

Wireless-G router

AirLink 54Mbps

$43.

each

Wireless-G PC card

AirlLink 54Mbps

$32.

each

Wireless-B PC card

AirlLink

$21.

each

 



[1]     John Stuart Mill, The Positive Philosophy of Auguste Comte, 1975.

[2]     Neil Beagrie, Meg Bellinger, Robin Dale, Marianne Doerr, Margaret Hedstrom, Maggie Jones, Anne Kenney, Catherine Lupovici, Kelly Russell, Colin Webb, and Deborah Woodyard, Trusted Digital Repositories: Attributes and Responsibilities, RLG-OCLC Report, May 2002.   See also H.M. Gladney, Critique: Attributes of a Trusted Digital Repository, October 2001.

[3]     Library of Congress, Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure and Preservation Program, 2003.  See especially its Appendix 9: Technical Architecture for the National Digital Information Infrastructure and Preservation Program (NDIIPP).

      See also Neil Beagrie, National Digital Preservation Initiatives: An Overview of Developments in Australia, France, the Netherlands, and the United Kingdom and of Related International Activity, CLIR pub116, 2003, and Council on Library and Information Resources and the Library of Congress, The State of Digital Preservation: An International Perspective, CLIR pub107, 2002.

[4]     David Levy, Heroic Measures: Reflections on the Possibility and Purpose of Digital Preservation, Proceedings of the Third ACM Conference on Digital Libraries, June 1998.  See particularly its Preserving What—and on What Basis?

[5]     William J. Mitchell, Alan S. Inouye, and Marjory S. Blumenthal, Editors. Beyond Productivity, Information Technology, Innovation, and Creativity, Nat. Acad. Press, Washington, 2003.  See also Perkings, John. Dawson, David. Geber, Kati. Beyond Productivity: Culture and Heritage Resources in the Digital Age, D-Lib Magazine 10(6), June 2004.

[6]     See, for instance, Norbert Lossau, Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet, D-Lib Magazine 10(6), June 2004.  Also Stephen Arnold raises the possibility that today’s libraries could become marginalized in his Information boundaries and libraries, February 2004.

[7]     For instance, the most useful source I found for a quick study of Willard Quine’s work (see below) is the Web site provided by his son, Douglas.  Someone might argue that such sources lack long-term durability.  This problem is as likely to be solved by institutions other than academic institutions as by research libraries or archives.

[8]     See the Gartner Group predictions reported by TechRepublic on November 8, 2002.

      For a penetrating analysis of business world challenges, see Peter F. Drucker, Management Challenges for the 21st Century, Harper, New York, 1999.  ISBN: 0-88-730998-4

[9]     This list surely overlooks other important factors.  DDQ projections are not currently backed by known evidence or explicit justification.  Their intent is to stimulate thinking, rather than to report careful research and analysis.  Such limited intent would be inappropriate for a scholarly report, but is appropriate for a newsletter.

[10]    Hoffman, Donna L. Novak, Thomas P. Venkatesh, Alladi. Has The Internet Become Indispensable? Comm. ACM 47(7), 37- 42, July 2004.  An easily overlooked component is the highly educated, formally retired population.  It is only recently that a significant population fraction was healthy and vigorous beyond age 65.

[11]    According to BusinessWeek (July 12, 2004, page 14), “If forecasts from investment bank Piper Jaffray hold, around 153 million Chinese will be online by 2006, and China will surpass the U.S. in Web users.”.

[12]    People talk of “exponential information growth.”  See Tony Hey and Anne Trefethen, The data deluge, in Fran Berman, Geoffrey Fox and Tony Hey (eds.), Grid computing: making the global infrastructure a reality, Wiley, January 2003.  See also Philip Lord and Alison Macdonald, Data curation for e-Science in the UK: an audit to establish requirements for future curation and provision, JISC report, May 2003.

[13]    See a special section on on the emerging infostructure in Communications of the ACM 46(10), Nov. 2003.  See also Research to the Rescue in eWeek, June 28, 2004.  One trend is the appearance of many foms of outsourcing.

[14]    In Nancy Brodie, Authenticity, Preservation and Access in Digital Collections, Preservation 2000, we find, “Technological tools of authentication such as encryption make preservation of authentic documents very difficult.  All the pieces of a public key infrastructure, encryption algorithms, encryption and decryption software, private keys, public keys, certificates, certificate authorities, etc. would have to be preserved along with the encrypted document.”   This and more are, in fact, forced for the more fundamental reason taught by Wittgenstein’s Philosophical Investigations.

      Brodie’s concern simply manifests that digital technology often forces its users to express levels of detail that they are not accustomed to providing.  Human beings accept an immense amount of vagueness in their communication.  (See Chapter 4 of Quine’s Word and Object.)  This is partly because they often have opportunity to inquiry whenever a vagueness proves troublesome, which is not usually possible for computer programs, a fortiori for preserved digital objects.  The care needed with digital objects has a reward, frequently bringing to light confusions that have not been noticed, and requiring us to improve the precision of our speak.

[15]    In fact, in human communication, most references are implicit.  This is sometimes alluded to as ‘shared social context’.  How human natural language copes with references is carefully analyzed in Quine’s Word and Object.

[16]    This is not a statement about how any information entity is represented, but rather about its intended meaning.

      Carnap (loc. cit.) §11 teaches the precise meaning of ‘structure’.

[17]    Stephen Chapman, Counting the Costs of Digital Preservation: Is Repository Storage Affordable?  See also IlkkaTuomi, The Lives and Death of Moore's Law, First Monday 7(11), November 2002.

[18]    Deanna Marcum and Amy Friedlander, Keepers of the Crumbling Culture: What Digital Preservation Can Learn from Library History, D-Lib Magazine 9(5), 2003.  See also Brodie (loc. cit.)

[19]    Levy (loc. cit.) includes: “Within the archival community, whose focus … has been on paper, microform, and [so on], the predominant answer to [“preserving what?”] has shifted over time.  [O'Toole, On the Idea of Permanence, The American Archivist 52(1), 10-25,1989.] notes that … in the early nineteenth century, archivists took their mission to be the preservation of the information contained in documents rather than the original documents themselves.    It was only … in the twentieth century, that advances in preservation theory and practice … made it possible … preserving … original materials.  The pendulum thus swung from … preserving the information content of documents to … preserving the artifacts themselves.  O'Toole suggests, however, that the current focus is inadequate and argues for a return to the earlier position …”

[20]    Having too many standards might present more difficulties than having too few.  Consider health services.  The National Alliance for Health Information Technology identifies more than 450 mandatory and voluntary standards, more than 200 organizations with standards working groups, and more than 900 standards publications!

[21]    Using different representations of a single information object is common; which representation is convenient generally depends on the application of the moment.  The same information is often transformed from one representation to another.  Whether one uses the same identifier for different representations is itself a subjective choice; for this reason, it is common to assign several identifiers to a single object.

      See Digital Resource Identifier (DRI) and Reliable References in Henry M. Gladney, Trustworthy 100-Year Digital Objects: Evidence After Every Witness is Dead, ACM Trans. Info. Sys. 22(3), 1-31, July 2004.

[22]    We can correctly call the remote repositories in this figure ‘trusted’, because each is trusted by the central (relative to this discussion) repository’s manager for the limited responsibility of holding and perhaps distributing replicas.  In contrast, see Trust, Trusted, Trustworthy in DDQ 1(2).

[23]    See, for instance, Figure 1 in DDQ 2(3).  Deployed layering is more complex than our figures suggest.  The overall layering includes several layers in each of communication services (TCP/IP), file managers, and database management subsystems.  We do not discuss these we believe that is neither affordable nor necessary to change any of this technology as part of providing for long-term preservation.

[24]    To understand why this is important, consider the case of two consumers, with the first receiving a work as part of his subscription to its publisher’s output and the second fetching the same work many years later from an archival repository.  If these readers then communicate about details of what they suppose to be the same work, they might be inconvenienced by differences in their copies.

[25]    David Levy, loc. cit., which is a set of reflections on the meaning and possibility of preserving digital materials.”

[26]    From a presentation foil shared by Steve Griffin, National Science Foundation.

[27]    ‘Interpretation’ is ambiguous in discussions of digital document use.  We use ‘rendering’ for the transformation from digital representation to human-perceptible form, reserving ‘interpreting’ for the human conversion of stimuli to sensations, or for the conversion of visual and oral stimuli (Wittgenstein’s ‘pictures’) to meanings and concepts.

[28]    See Notes about Digital Preservation in DDQ 2(2).

[29]    See Engelmann’s island-ocean analogy in the How Can We Use Wittgenstein’s Philosophy section of DDQ 1(2).

[30]    Ludwig Wittgenstein, Philosophical Investigations, Bilingual Third Edition, Blackwell Publishers, Oxford, 2001.  ISBN 0-631-23127-7

[31]    Rudolf Carnap, The logical structure of the world; pseudoproblems in philosophy, translated by Rolf A. George, Univ. California Press, 1967.  ISBN 0-812-69523-2   Originally published in 1928 as Der Logische Aufbau der Welt.

[32]    David Miller (ed.), A Pocket Popper, Oxford U.P., 1983.  

[33]    David Nimmer, Adams and Bits: of Jewish Kings and Copyrights, 71 S. Cal. L. Rev. 219-245, 1998.  Also in the Copyright Society Journal 46(2), 1998.

[34]    This notion of what is essential is at the core of the 1990s lawsuits about “look and feel” of screen presentations by software, such as the Apple vs. Microsoft lawsuit of 1988-1994.  The Society of Motion Picture and Television Engineers calls this ‘the essence of the work’, and often alludes simply to ‘essence’. 

[35]    The word ‘essential’ is often misused by failing to be grounded in some definite purpose of identified individuals.  Here the purpose is implicit, viz., that the document speaks for its author(s) helping to convey intended meaning.  The question of usefulness of language is central to Wittgenstein’s Philosophical Investigations and re-addressed in the second half of Levy (loc. cit.).  The grammatical question of implied referents is dealt with carefully in Quine’s Word and Object.

[36]    This series of papers treats the needs of the ultimate end users, information consumers and information producers, as more important than the needs of digital repository personnel—a prioritization not evident in cited digital repository literature.

[37]    Leon Erlanger, Memories that Last, PC Magazine 62-63, Jan. 20, 2004.

[38]    John C. Dvorak, The Dead-Media Bogeyman, PC Magazine 77, July 13, 2004. 

[39]    For instance, see Kevin Schürer’s The Implications of Information Technology for the Future Study of History in Higgs, Edward (ed.) History and electronic artifacts, Oxford U.P., 1998.  ISBN 0-19-823633-6

[40]    Marco Fioretti, The OASIS Standard for Office Documents: How All Users and Developers Can Benefit, Linux Journal 119, 64-7, March 2004.

[41]    Until I try to use METS to prepare documents for preservation, I cannot be confident about this endorsement.  We plan to use it in a prototype implementation of the document packaging proposed in my Trustworthy 100-Year Digital Objects series.

[42]    Henry M. Gladney, Trustworthy 100-Year Digital Objects: Evidence After Every Witness is Dead, ACM Transactions on Information Systems 22(3), 1-31, July 2004.

[43]    For an explanation of what cryptographic keys and XML signatures provide—content integrity assurance, source authentication, and signer non-repudiation—and the current state of XML signature tools, see Peter Thorsteinson, G. Ganesh. .NET Security and Cryptography, Prentice Hall, 2003.  ISBN: 0-131-00851-X

[44]    Andrew Waugh, On the use of digital signatures in the preservation of electronic objects, Proceedings of the DLM-Forum 2002 Workshop, @ccess and preservation of electronic information: best practices and solutions, 510-517, 2002.

[45]    As another example of why this makes sense, consider the outré case of someone who wants evidence that a certain play is by Shakespeare rather than by Marlowe.  Except if this reader is interested in the narrow historical question of whether the true author of Shakespeare’s plays was in fact Christopher Marlowe, nobody really cares about the connection of the plays to a particular collection of buried bones! 

      See the Christopher Marlowe anagrams at William Shakespeare's burial place in Holy Trinity Church, Stratford-upon-Avon.

[46]    The first U.S. National Science Foundation funding of digital library projects occurred only about two years after IBM Digital Library (now named to IBM Content Manager) was delivered to its first customers!

[47]    NDIIPP Technical Architecture - Update: Version 0.2 of the Technical Architecture for the National Digital Information Infrastructure and Preservation Program is available.  The Library is seeking feedback on this draft. 

[48]    A Storage Subsystem for Image and Records Management, IBM Systems Journal 32(3), 512-540, (1993) articulates the core design of today’s IBM Content Manager.  Since the first IBM Digital Library offering (as IBM Content Manager was originally known) appeared in 1991, eight subsequent releases provided enhancements to performance, scalability, reliability, and interfacing with pre-existing applications—topics that are hardly mentioned in competitive open-source promotions.

[49]    The following critique reacts to a 6th May JISC posting, which was more complete than the Humboldt web posting.  The project partners include Sun Microsystems Inc., the Austrian Literature Online Consortium, and XiCrypt GmbH.

      The Humboldt team has seen a prior version of this critique and is considering its reactions, but is not yet ready for them to be communicated.  (Private communication with Suzanne Dobratz.)

[50]    See Trusted Digital Repositories and Trust, Trusted, Trustworthy in DDQ 1(2).

[51]    See Use and Misuse of OAIS in DDQ 1(3).  We discussed this in 2002 with authors of OAIS.  Their leader, Don Sawyer, agreed with the DDQ 1(3) opinion.

[52]    See Consensus as an Impediment to Progress in DDQ 1(3).  There is insufficient evidence that the prevailing opinions are well founded.  Scientific history is full of examples of mistaken consensus.

[53]    See Raymond Lorie, the UVC: a Method for Preserving Digital Documents, in Koninklijke Biblilotheek Workshop on Digital Preservation: Technology & Policy, December 2002.   See also his A Methodology and System for Preserving Digital Data, JCDL 2002.  For up-to-date positioning and more technical detail, see a preprint, Trustworthy 100-Year Digital Objects: Durable Encoding for When Its Too Late to Ask.

[54]    Several top-level requirements are identified by Notes about Digital Preservation in DDQ 2(2).

[55]    Risks to end users of cultural documents are mostly low.  However, how can we today know how future readers might use and depend on repository holdings?  The question has practical implications for cases in which the somebody’s finances, health, or reputation could be impacted by his or someone else’s using flawed information.

      Answers must address the concern that the easy mutability of digital documents creates reliability exposures.  It must further address the expectation that the number of digital objects held in a repository will be much larger than the historical number of paper documents and that audit procedures (as suggested by the RLG report cited above) would need to discover which small number of digital holdings some unscrupulous or careless employee might have altered in the decades that the repository held the digital object.

[56]    An on-line interview reports the Stanford Libraries’ book digitization project: “The [digitizer] … operates quickly and seamlessly and includes such neat touches as an air-blowing mechanism to separate pages that … stick together. The demonstration was impressive, as was … a Univ. librarian eager to digitize the knowledge of the past for the readers of the future.”

[57]    Viktor Kraft, The Vienna Circle: The Origin of Neo-Positivism: the origin of neo-positivism; a chapter in the history of recent philosophy, Philosophical Library, New York, 1950.

[58]    W. V. (Willard Van Orman) Quine, Word and object, Cambridge, Mass: M.I.T. Press, 1960.  ISBN 0-262-67001-1

[59]    This warning is stimulated by recent correspondence with a university project that understands risks associated with commercial software, but shows no evidence of understanding critical differences between seasoned commercial offerings and recent open source software.  For instance, most of the code of commercial data management software is provided to manage potential errors that include system failures, to allow different operating system platforms, and to provide scaling from very small to very large data collections.  These aspects are not always mentioned in software promotions.

      In the case at hand, my correspondent emphasized functionality in the upper levels of software layering—tailoring for particular classes of installation.  (Figure 1 in DDQ 2(3) suggests the kind of layering I have in mind.)  Arguably, tailoring for institutions’ specialized needs and preferences is best handled in client software and in the upper layers of server software.  In fact, this is precisely why we provided the Document Storage Subsystem/Archival Storage layering (see Figure 1) in the very first version IBM Digital Library™ in 1991.  It has been continued into today’s IBM Content Manager™.  See H.M. Gladney, A Storage Subsystem for Image and Records Management, IBM Systems Journal 32(3), 512-540, (1993).

[60]    Except for this weakness, I am delighted with the camera, which I purchased about three years ago.

[61]    Just before I was about to release this DDQ number, Why I’m Staying Away from Internet Explorer appeared in BusinessWeek, page 24, July 12, 2004.  The BusinessWeek e-mail newsletter introduced this with, “Until Microsoft proves it can fix IE's security bugs, you're better off using one of a few good alternatives as much as possible.”

[62]    A Google search for ‘better enemy of good’ yielded over 2,000,000 hits.  See an interesting one.

[63]    To bring a browser into productive service takes me about an hour’s tailoring.  To gauge its quality, I would have to use it actively for several weeks.

[64]    San Jose Mercury News, 10 June 2004.

[65]    The prices reported include California sales tax (8.25%).