Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 4, Number 2, 2Q2004

 

 

DDQ Home

Citations

Glossary

 

HMG Consulting

Saratoga, CA 95070

©  2005, H.M. Gladney

 

ISSN: 1547-8610

 

Digital Preservation

Library Institutions vs. Preservation Technology?

“[T]echnology is rather easy.  Or more exactly, technology is the branch of human experience that people can learn with predictable results.  … a good many Englishmen have been skilled in mechanical crafts for half-a-dozen generations.  Some­how we've made ourselves believe that the whole of technology was a more or less incommunicable art.  It's true enough, we start with a certain advantage.  Not so much because of tradition, I think, as because all our children play with mechanical toys.  They are picking up pieces of applied science before they can read.”                                                                 C.P. Snow, 1959 [1]

The final sentence of a Library of Congress 2004 press release—“Librarian protects digital artifacts”—startled me: “As [Laura] Campbell often reminds herself and others, it is not technology that preserves important cultural and historical works. ‘Institutions preserve,’ she said.”

Institutions without technology?  And what about people?  Why not write, “Technology preserves” or “Money preserves” (referring to the $100M expenditure that Campbell is managing for the Library of Congress, with little evidence that her staff has examined what preservation technology offers)?

The rhetoric of “Institutions preserve” invites consideration of whether institutions like the big research libraries are, in fact, essential for digital preservation.[2] 

Consider the issues.  Who should select?  Who can package for preservation?  How will people find information?  What people want to preserve will include many more objects than traditional libraries can handle, and include works outside traditional library scopes.  Self-achiving  has already begun on a significant scale.[3]  Tools to make packaging and durable descriptions[4] easy and convenient are likely to be much improved from what is already available.[5]  As with other information services, skills previously found only among trained librarians will be acquired by many computer users.[6]  An illustration of what is happening is provided by:

“A common set of file formats has the potential to be the most meaningful advancement for free software on the desktop.  

Desktop integration begins with documents, not with any toolkit or bundle of applications.  If files can be read and written by every application, users can communicate, work together and become integrated.    The significance of [the OASIS effort] to promote a file format … cannot be [over]estimated.  Free as in free formats is even more important than free software.  Only with them and the internal structuring that comes from XML can data be exchanged, with new or different programs without any need for converters, or be directly edited, indexed, analyzed and exchanged between heterogeneous groups or servers—like Web services without the hype.  Data will start belonging exclusively to end users.”          Fioretti, 2004 [7]

How can bitstrings be made safe against loss?  A modest extension of existing data grid tools and protocols would suffice.[8]  For instance, the rules portion of LOCKSS could be extended to seek accepting storage sites for replicas of documents for which preservation is wanted and to ensure that enough copies exist.  Such mechanism can be devised to be insensitive to the demise of the originating computing environment by mimicking the propagation of micro-organisms.[9]  The copies would survive as long as the computing and network infrastructure survived.[10]  Their durability would be at least as good as can be provided by any institution of the kind that Campbell has in mind—probably better.

The first imperative of any institution is its own continued existence, and this is closely followed by its urge to increase its influence.  The press release can be seen as part of a rearguard action to protect the jobs of research librarians.[11]  If this is, in fact, part of what the Library of Congress and its NDIIPP (National Digital Information Infrastructure Preservation Program) partners have in mind, their ends would be better served by thinking through what unique social values they can offer.[12]  Only if they do this effectively will their roles in the digital world be as important as it already is in traditional libraries.[13]  The rhetoric “Institutions preserve” lulls and leads in a futile direction.

Preserving Dynamic Digital Objects

The Digital Curation Center (DCC) is preparing a “best practices” manual for which it is inviting experts to contribute individual fascicles.  The editors asked me to write Preserving Dynamic Digital Objects, whose abstract reads:

“Most articles about digital preservation come from the cultural heritage community.  The needs they express will expand to those of businesses wanting safeguards against diverse frauds, attorneys arguing cases based on the probative value of digital documents, and our personal medical records.  The U.S. NDIIPP  expresses urgency for preserving authentic digital works.  We know how to accomplish this reliably for every kind of information, with packaging that will seem convenient to all kinds of user.

“Among the digital record types that we might want to preserve, “dynamic records” are thought to … pose unique challenges.  The nature of [expressed] concerns suggests review of what it means for a record to be dynamic …  Because confusion about ‘dynamic’ occurs even among preservation experts, an exacting analysis is merited.  We provide this by using methodology [based on] early 20th century philosophy.

“For a work to be eligible for copyright protection, it must be fixed—written in a stable representation.  Whenever a computing system saves a record copy, or shares one with a remote user, this is a stable version.  Since dynamic digital objects introduce no technical problem beyond those already solved for stable documents, it is unnecessary to prescribe any new best practices for preserving them.”
                                                                                                                                                                  Gladney, 2005

Choosing Digital Repository Software

As part of volunteer work for a local museum, I have been looking into how that institution might choose software for managing and sharing history of computing materials—including obsolete software that it would like to offer in executable versions.  Early thinking suggests that making the best choice might not be easy, because there are about 80 open-source offerings[14] and about 20 commercial offerings worthy of consideration.  Perhaps this embarras de richesse is best handled by an orderly approach, starting with a pedantically careful evaluation of requirements. 

To that end, I inspected several Web-accessible analyses: a Canadian Heritage Collections Management Software … Criteria list,[15] a British archive requirements analysis that has been refined by many participants,[16] a ten-year old IBM product development analysis,[17] and a German prescription for criteria development. 14  A starting assumption was that each of these would contain many line item requirements worthy of consideration by any cyber-museum, but still not be completely satisfactory.  In retrospect, the assumption seems valid.  The Canadian document seems to be derived from considerations for traditional museums[18] that are not extended sufficiently to cyber-museums.  The British document is more current, but makes extensive use of statements whose satisfaction cannot be objectively tested.  The IBM document is out of date and was not published.  The German document is short on criteria, but does prescribe a useful scheme for criteria development (Figure 1). 

Figure 1: Development of the evaluation scheme and roles in the decision process (translation of Borghoff Abbildung 1)

I also inspected the British Standard Proce­dures for Collections Recording Used in Museums.[19]  Although this is oriented toward traditional museums, it is a good reference for metadata that might accompany any kind of holding.  In contrast, the Model Requirements for the Management of Electronic Records[20] proves to be not applicable, as it deals primarily with operational data management of business records rather than the much less dynamic management of archival information.

Drawing on all these sources, I have drafted a new requirements analysis for museums working to extend their offerings to digital holdings.  This focuses on repository functionality, leaving to other efforts requirements for packaging individual digital holdings for preservation, integration of digital holdings to be interesting and informative extensions of traditional museum artifacts, organization of exhibits for different classes of museum visitor, and presentation on the World Wide Web. 

This draft is available to a limited number of qualified reviewers.  If you are interested, please send me an e-mail indicating what use you would make of it assuming that it proves to be useful for your institution.

Making Repository Software Replaceable

The recently approved Content Repository API for Java (JSR 170) is a standard interface whereby applications can retrieve from and manage a digital library.[21]  Its authors are from commercial enterprises and consortia: the Apache Software Foundation, Day Software, Fujitsu, IBM, SAP, Microsoft, BEA, Documentum, SUN, Novell, Macromedia, etc.

"As the number of ... proprietary content repositories has increased, the need for a common programmatic interface … has become apparent.  The aim of the Java Content Repository (JCR) API specification is to provide such an interface, [thereby laying] the foundations for a true industry-wide content infrastructure.

"Application developers and custom solution integrators will be able to avoid the costs associated with learning the particular API of each repository vendor.  Instead, programmers will be able to develop content-based application logic independently of the underlying repository architecture or physical storage.

"Customers will also benefit by being able to exchange their underlying repositories without touching any of the applications built on top of them."                                                                                 Content Mgmt. API for Java Spec., §2.1

The specification lists its goals as:

Ø     Not bound to any particular underlying architecture, data source, or protocol.[22]

Ø     Easy to use from the programmer’s point of view, representing the core functionality of a content repository without venturing into “content applications”.

Ø     Easy implementation on top of as wide a variety of existing content repositories as possible.[23]

Ø     Also standardizing some complex functionality needed by advanced content-related applications.

To mitigate tension between the last two goals, JSR 170 specifies two compliance levels.  Level 1 defines read-only functionality: reading repository content, inspectng of content-type definitions, supporting namespaces, content export to XML, and searching.  Level 2 adds methods for writing content, content-type assignment, and content import from XML.  Finally, the specificatiion defines as optional interfaces for atomic transactions and locking, versioning, access control, and some search extensions.

Careful reading of the JSR 170 specification leaves me impressed and confident that implementations will achieve the stated goals, perhaps with modest extensions that implementers discover to be desirable.  Since it is new, not many supporting offerings are available yet.  Perhaps JSR 170 support be regarded as a sine qua non requirement for any institution’s future acquisiton of repository software, even if this somewhat delays digital library implementation.[24]

Epistemology

“The correct method in philosophy would really be … to say nothing except what can be said, i.e. propositions of natural science … and then, whenever someone else wanted to say something metaphysical, to demonstrate to him that he had failed to give a meaning to certain signs in his propositions.”                                                                                          Wittgenstein, Tractatus 6.53

A friend recently asked what was meant by ‘epistemology’.  He was pleased to have both the definition from the Concise Oxford English Dictionary and my own construction.  Both follow.[25]

epistemology

theory of knowledge, especially with regard to its methods, validity, and scope. (Concise OED).

branch of philosophy that deals with the origin, nature, methods and limits of human knowledge; the branch of philosophy dealing with theory of knowledge (in contrast to belief or opinion).  It analyzes the possibilities and limitations of answers to, “What do people know?  What can be known?” and “What can people communicate?  How can they minimize the occurrence of misunderstandings when they communicate?”

Most philosophy belongs to one of four branches: epistemology (about knowledge), metaphysics and religion (about beliefs), aesthetics (about beauty and taste), and ethics (about correct behavior).

The Word ‘Dialectic’

I have long been puzzled about what ‘dialectic’ means, and so am happy to recommend analysis beginning:

“The term ‘dialectic’ is almost as old as the practice of philosophy.  Like many other labels of great antiquity, it has been used as a tag for concepts, activities, and situations of the most heterogeneous variety.  Few philos­ophers have ever employed the term in the same sense as any of their pred­ecessors. Indeed, rarely is it the case that any philosopher has consistently adhered to any one meaning in his writings.  What the dialectic is, there­fore, can no more be adequately treated short of a history of its definitions in use from Plato to the present than we could straightway say what the empirical, the reasonable, the sensible, the romantic, and similar terms mean in the history of philosophy.  

“There are two generic conceptions of dialectic under which the vari­ous meanings … may be subsumed. The first is the conception of dialectic as a pattern of existential change either in nature or society or man where the ‘or’ is not exclusive.  The second is the view that dialectic is a special method of analyzing such change.  Usually, but not always, it is held that the method of dialectical analysis in some sense ‘reflects’ or ‘corresponds to’ the dialectical pattern of change.  In any case, there is always a distinction drawn, though with no great regard for consistency, between the dialectical type of change and other kinds.”
                                                                                                                                                      
Sidney Hook, 1951 [26]

The Word ‘Scientific’

“If it’s got ‘Science’ in its name, it ain’t one.”               O ften heard in the halls of a computer science laboratory

Working as a computer scientist, I am still happy to concede that ‘Computer Science’ is less a science than an engineering discipline.  However, many physicists, chemists, and biologists are annoyed with the extension of ‘science’ to other disciplines, especially those that are not careful with what they think of as ‘scientific method’ or with empirical observation.  That such irritation is far from new is illustrated by:

“The great Poincaré once remarked that while physicists had a subject ­matter, sociologists were engaged almost entirely in considering their meth­ods.  Allowing for the inevitable divergence between the sober facts and heightened Gallic wit, there is still in this remark a just rebuke (from one who had a right to deliver it) to those romantic souls who cherish the per­sistent illusion that by some new trick of method the social sciences can readily be put on a par with the physical sciences with regard to definiteness and universal demonstrability.  The maximum logical accuracy can be at­tained only by recognizing the exact degree of probability that our subject­ matter will allow.”                                                               Cohen, 1931 [27]

“To a very great extent the term ’science' is reserved for fields that do progress in obvious ways.  Nowhere does this show more clearly than in the recurrent debates about whether one or another of the contemporary social sciences is really a science.  These debates have parallels in the pre-paradigm periods of fields that are today unhesitatingly labeled science.  Their osten­sible issue throughout is a definition of that vexing term.  Men argue that psychology, for example, is a science because it possesses such and such characteristics.  Others counter that those characteristics are either unnecessary or not sufficient to make a field â science.  Often great energy is invested, great pas­sion aroused, and the outsider is at a loss to know why.”                                                                                                                                               Kuhn, 1962 [28]

Purists are likely to point out that, until the mid 19th century, physics and chemistry were not called science, but rather “natural philosophy”, and that ‘science’ originates from the Latin verb scire—to know.

Russell's Paradox Revisted

Since DDQ 2(1) commented on Russell's Paradox, further reading has much increased my estimate of its significance as a stimulus for philosophers’ care with language.  The paradox arises from “Is the set of all sets that do not contain themselves a member of itself?”  The difficulty—that “the set of all sets that do not contain themselves” is nonsensical, i.e., does not denote a mathematical or empirical entity—came as a shock to logicians a century ago.[29]

“Russell's paradox created an air of despair in the logicists' camp, from which Frege never recovered.  The traditional notion of a set, of the one-one correlation between the intension and extension of a coherent predicate, is central …, and this paradox yanked out the rug from under it.  ‘Without a single object to represent an extension,’ says Russell, ‘mathematics crum­bles.’  Frege admitted that … math­ematics can be reduced to logic only if ‘set' is a logical notion; Russell has inad­vertently proven that it is not.  Frege sank into despair, and gave up on his lifelong project.”                          Sullivan, 2003 [30]

Ordinary language allows phrases that are meaningless.  Frege was aware of this hazard

“I see the greatest difficulty for philosophy: the instrument … for its work, namely ordinary language, is little suited to the purpose, for its formation was governed by requirements wholly differ­ent from those of philosophy.  So also logic is first of all obliged to fashion a usable instrument from those already to hand.  And for this purpose it initially finds but little in the way of usable instruments available.”                          Frege, 1918 [31]

Twelve years earlier, he had already written:

“… only true thoughts are admissible premises of inferences.  It isn't strictly sentences, it is thoughts which have contradictory counterparts.  If you always bear this in mind, that we cannot legitimately infer from sentences, but only from true thoughts, and that proper names and concept-words must be meaningful, then … contradiction … cannot obtain.  before we go on to proofs at all, we must have assured ourselves that the proper names and concept-words we employ are admissible. …  Only if someone makes a proper name from the corresponding concept-word by means of the definite article or demonstrative, does he fall into error.   ‘the set of all sets which do not contain themselves as elements' is not a concept-word, it is a proper name, and it can only be a question of whether this proper name is meaningful.    We may distinguish two different ways of using the word ‘set', going with two different conceptions, which are probably most plainly identified by the words ‘aggregate' and ‘extension of a concept'.  But frequently these conceptions do not occur in their pure form, but mixed together and this makes for unclarity.  The aggregative conception is the first to offer itself, but the requirements of mathematics pull towards the opposite side, and so confusions easily arise.” Frege, 1906 [32]

That a phrase or sentence might be nonsensical should not surprise us.  ‘The flight of an ostrich’ is linguistically nonsensical because no bird we call ‘an ostrich’ can fly.  ‘The flight of a 2-ton bird’ is empirically nonsensical because no known material is both strong and light enough for the bones of such a bird.  Although Russell’s paradox might not be easily understood (perhaps because it touches on unbounded sets), that other mathematical phrases might be nonsensical should be obvious to anyone who thinks about “the integer that is greater than 2 and less than 3”.

The drawings of M.C. Escher and his imitators