Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 4, Number 4, 4Q2005

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2005, H.M. Gladney

 

ISSN: 1547-8610

 

There is a story of a dog-owner who prided himself on the perfect training of his pet.  Whenever he called: 'Here! will you come or not!' the dog invariably either came or not.  That is exactly how electrons behave when controlled by probability.                                                                          Michael Polanyi, Personal Knowledge, p.21

Philosophy of Knowledge and Communication

I have kept to three fundamental principles:

·        always to separate sharply the psychological from the logical, the subjective from the objective;

·        never to ask for the meaning of a word in isolation, but only in the context of a proposition;

·        never to lose sight of the distinction between concept and object.              Gottlob Frege[1]

Colleagues and I began to read scientific philosophy about five years ago simply because we had heard of the difficulty of understanding Wittgenstein’s Tractatus Logico-Philosophicus[2] and relished the challenge—a reason similar to that stated by mountain climbers.[3]  Although it is retrospectively obvious that epistemology is the right foundation for thinking about digital preservation, this became apparent to us only when we grokked[4] that it had to do with understanding what other people said.

The Class of ‘Balls’ Is Not Obviously the Set of Balls

We have learned that every name, phrase, or sentence has (at least) two senses.  (Here, sense is almost, but not quite, a synonym for meaning.)  The first sense—an extension—is a relationship to some specific conceptual or real world object—a "nominatum".  The second sense—an intension—is a relationship within the context at hand.[5] 

For instance, the extension both for "the morning star" and also for "the evening star" would be the planet Venus.  The intension for "the morning star" might be "the brightest pinpoint light in the sky in the wee hours", with a corresponding sense for "the evening star".  It is an exercise in reasoning and science to recognize that the extension corresponding to the two different names is a single real-world object.

Before I read Davidson’s analysis,[6] I had thought of intension and extension only in terms of general expressions such as "stars" (rather than "the evening star", which philosophers call a particular expression, or a proper name).  That is because we were trying to understand why it might be important to distinguish between what was meant by the word ‘class’ and the word ‘set’.

Natural language has many words to denote “a number of objects that are somehow related”—collection, set, and class are prominent.  Each of these also has a technical meaning—a meaning used by some specialized community.  Librarians use collection to denote intellectual works that might be acquired because they conform to a policy decided by their institution.  Mathematicians use set, but decline to define it because the words that might appear in a definition are themselves defined in terms of set.[7]  (This is similar to the treatment of ‘point’ and ‘line’ in modern accounts of Euclidean geometry.)  Philosophers often use class to denote objects indicated by a natural language phrase, such as “spherical objects” or “balls”.

Such care is necessary for clarity.  While an aggregation might conform to what is intended by ‘set’, it might also conform to what is intended by ‘class’, or it might instead satisfy neither intension.  Whether or not a specific class corresponds to a particular set is a matter for careful inquiry.

A Concise Dismissal of “Intelligent Design

Until now, DDQ has applied scientific philosophy only to knowledge communication as part of digital preservation.  It has much, much wider applicability.  For instance, it can illuminate a current political debate.

In the United States, religious notions must not be taught in tax-funded schools, but scientific theories are permitted (no matter how uncertain or bizarre they might seem).  To circumvent this constitutional principle, Kansans have approved the teaching of “intelligent design” (ID) as a “scientific” alternative to the theory of evolution.  If you want to argue the issue with a born-again Christian, consider the following approach.

The acid test differentiating a religious or metaphysical belief from a scientific theory is identification of feasible empirical observations that might yield refutations of the theory in question.[8]  Such observations would yield objective scientific facts rather than unprovable value judgments.  DDQ therefore suggests asking ID design proponents to identify observational tests whose results could conceivably refute their non-evolutionary theories.[9]

In news that appeared just as this DDQ number was being readied for distribution, a Pennsylvania judge ruled that ID could not be taught as science.[10]  U.S. District Judge John Jones, who was appointed to the bench by President Bush in 2002, ruled that teaching ID would violate the Constitutional separation of church and state.  Excerpts from the opinion follow.[11]

"We have concluded that it is not [science], and moreover that ID cannot uncouple itself from its creationist, and thus religious, antecedents. 

"To be sure, Darwin's theory of evolution is imperfect.  However, the fact that a scientific theory cannot yet render an explanation on every point should not be used as a pretext to thrust an untestable alternative hypothesis grounded in religion into the science classroom or to misrepresent well-established scientific propositions.”

A local newspaper quotes Judge Jones, “The breathtaking inanity of the [school] board's decision is evident when considered against the factual backdrop which has now been fully revealed through this trial.''

Preservation of Digital Records

What is “Digital Preservation”? [12]

Until recently, I was confident that the expression “digital preservation” meant something like “measures to mitigate the deleterious effects of technology obsolescence, media degradation, and fading human memory.”  However recent digital repository literature suggests that some authors adopt a much broader definition.

For instance, the promotional literature for MIT’s DSpace software claims digital preservation support.  DSpace uptake might be enhanced by this claim being uncritically accepted by commentators.  In a time-consuming dig into what justifies, “The DSpace digital repository system … preserves … digital research material,” I sought preservation functionality beyond what competing content management packages provide.[13]  I found very little.

A DSpace objective—perhaps the most important justification for its funding—is to help MIT faculty save work that would otherwise not be published, such as experimental measurements.  We might reasonably call this a “digital preservation” activity, since without encouragement and help its originators would probably not try to ensure that their primary data survived their individual ephemeral interests.  On the other hand, both this purpose and the means for achieving it differ from what the seminal preservation task force report  calls for .[14] 

Using a single phrase for different activities is anything but helpful towards progress in a topic that many people are trying to understand.

I sometimes collect a grocery list in a digital file.  As a side effect of backing up valuable files, copies of such lists are saved on optical disks stored in a safety deposit box.  Should I describe this mechanism as a contribution to digital preservation technology?  What should we adopt as distinctions that facilitate precise communication?

Preserving Items of Low Individual Creation Cost

D-Lib Magazine 11(12) contains five articles about repository ingestion of content that DDQ has not considered as much as it should have—collections of digital objects, such as Web pages, which each individually cost little to create and which come from many authors who are not administratively related.[15]  Reported are results of ingestion tests for about 57,000 Web cullings—too large a number to pay curators to inspect each item individually.

These articles teach not only that DDQ needs to enlarge the class of materials it considers, but also identify specific important tests that ingestion software should make for every kind of preservation candidate.

About Digital Preservation Costs

They say you can’t do it, but remember, that doesn’t always work.                  Attributed to Casey Stengel

Aschenbrenner suggests that “costs of a digital repository are hard to calculate due to the lack of hands-on data from other initiatives.  … the lack of experience with digital preservation costs obstructs a complete picture.  In general, … costs are assumed to be … even greater in the digital environment than for paper.”[16]   

While this assessment might be reasonable, it and other preservation literature mostly ignore salient points:

Ø  Costing assumptions used by different institutions differ significantly, including in their estimates of the cost of money, the cost of labor and inflation, and how costs are allocated between distinct activities that share resources.  An example is the cost of supporting end-user access for new collections in an existing digital library.

Ø  The purpose of a cost estimate is likely to influence how it allocates cost contributions.  For instance, estimates to persuade funding for a new activity with staff and other resources not currently in place will look different from estimates for expanding established activities.

Ø  Comparing the cost of digital services with paper-based services should be accompanied by stating objectives, which are likely to be quite different.  For instance, an MIT Libraries objective is to enhance the reputation of the Institute by making its contributions much more visible than they already are—an objective that cannot be achieved by spending on its paper-based collection.

Ø  The cost of housing can be large.  For instance, if existing facilities are running out of space for storing new paper, it is not at all obvious that digital preservation costs will exceed those for paper.

Ø  The costs for hardware and software will depend on what is already installed and on technology acquisition assumptions.  For instance, the costs associated with open source software will be distributed differently than those with commercial content management software.[17]

Ø  It can help significantly to represent cost estimates in a spreadsheet that allows refinements as numerical estimates are improved by new information, that makes “what if” experiments readily achieved, and that allows the effects of different costing models to be assessed.  The “bottom line” number,[18] which will be briefly used and then ignored, might not be the most important result of cost estimation.  In contrast, the organizational and technical insights accumulated in the exercise can have enduring utility.

Digital repository literature makes surprisingly little allusion to employing professional accountants.  Cost estimation is a standard subject in all professional accounting programs.[19]  There is a plethora of books on the topic—too many for anyone except a professional to recommend which might be good for any particular case.

Requirements Analysis for Digital Repositories

Stimulated by a Computer History Museum need, I am thinking about institutional process for choosing content management software.  This work includes a 50-page requirements analysis checklist for museums, constructed partly by inspecting existing repository packages.  It has some overlap with the nascent Audit Checklist for the Certification of Trusted Digital Repositories.[20]

The approach faces a perplexing challenge.  Apparently next to no-one wants to use such checklists.  We recognize that doing so would surely be tedious.[21]  However, most repositories need to accommodate many detailed requirements.  Since “the devil is in the details,” I know no alternative to working with and refining careful detail lists, discussing their individual items with the eventual software users, and using such lists both for software development and software selection.

The dilemma is far from new.  Numerous 1980’s software development studies resulted in “best practices” expositions that next to nobody either objected to or followed, even though the business press had many articles about software project cost over-runs, schedule disappointments, and outright failures partly caused by inadequate appreciation of requirements.[22]  This dilemma is still with us.

In a Fortune Magazine interview, Fred Brooks observed that, while technology managers widely quote his 1975 book, The Mythical Man-Month, few actually follow its recommendations. [23]  Brooks, who managed IBM's OS/360 software, argued that adding more people to a software project that is behind schedule slows it even further.  This is because adding people causes more bureaucracy and more needed training.  It can be better to slip the schedule, limit the scope, and/or phase features into later versions.  Brooks is not surprised that managers continue to make the same mistakes.

Can any reader suggest how to make progress, given that library patrons are sensitive to many, many nuances?

Speculation about Faster Progress towards Digital Repositories

Reasons for the slow progress toward practical digital preservation systems include insufficient partitioning into system components that can be addressed with only small interdependencies, and failure to distinguish between aspects that have long been handled well by extant software offerings (e.g., IBM Content Manager—obviously a biased example, given my personal history) and additions needed to such well-understood technology.

Borghoff identifies more than 70 non-commercial repository offerings.[24]  Why are there so many?  Have their authors looked at competitors’ offerings?  They do not seem to mention them.  Their publications rarely identify and claim specific novelty in their work.  Are they aware of the functionality and quality of commercial offerings, or of what their open-source competitors have to offer?  Are we seeing another example of “publish or perish”?

For whose benefit are all these packages being created?  The number of packages makes it unlikely that potential repository institutions will even read the descriptions of more than a small fraction, much less evaluate them.  Given the lip service paid to sharing open source software, this situation is bizarre.

At a recent San Jose California trade show, I asked  the representatives of a dozen commercial content management vendors what added value they provide over open source repository packages.  Nobody that I asked had heard of DSpace, Greenstone, Fedora, or any other!  So it seems that mutual ignorance is evenly distributed.

In summary, much of the current work on digital repositories seems wasteful because it reproduces what might be acquired at less expense than its likely development costs.  In many cases, our tax dollars are funding this!  Surely these dollars could be better spent—perhaps on truly new preservation work.

Open Source

What Does “Open Source” Mean?

Among the flood of open-source articles,[25] many recommend uncritically.  Few are clear what ‘open source’ means.  There seem to be at least three distinct meanings: (1) software whose source code is made widely available—often to anyone who cares to inspect and perhaps to modify it; (2) software that is available for installation and use without licensing fee; and (3) software that is not produced or controlled by for-profit enterprises. 

In an example of a commercial open-source project, we read that "Carnegie Mellon University, Intel, and SpikeSource are developing a system to help IT departments [choose] open-source tools."[26]  Readers would reasonably expect that Intel and SpikeSource are participating partly to enhance their revenues. 

Recall that until an antitrust consent decree forced a 1970’s change, IBM “gave away” software to its hardware customers.  That regulatory conditions have changed to permit partial reversion is suggested by a recent announcement in which IBM figures prominently.  Several major storage companies have formed an alliance to provide infrastructure software based on standards.[27]  "Storage vendors have taken a significant step forward today to unite under a common goal of providing flexibility, simplicity, efficiency, and common standards for customers," said ACM President David Patterson, "Collaboration at this level is the only way we will manage and overcome the information explosion we are seeing today."

To add to the confusion, the relationship of open source packages to publicly defined and maintained information processing standards is often important, but there seem to be no widely accepted assertions about the extent to which standards adherence is required for software to be called “open source”.  The term is in danger of degrading into nearly meaningless hyperbole.[28]

AJAX: Taking Aim at Microsoft

“Asynchronous Javascript and XML” (AJAX) is a technique for creating interactive web applications using a combination of XHTML, cascading stylesheets, and the Document Object Model manipulated through JavaScript to display and interact with the information presented.  AJAX tools facilitate computing on distant Web servers accessed with standard browsers.  Users therefore need not instal software or move data when they switch computers.  Some commentators suggest that AJAX is a threat to Microsoft, because it can provide scaled-down alternatives to MS Office products—alternatives sufficient for some users.

Free to Good Homes: IBM Patents for Education and Health Care

In October, IBM made its entire invention portfolio—45,000 worldwide patents—available without charge to support health care and educational software standards.  Developers who use those standards—in the areas of Web services, electronic forms, and open documents—will be able to exploit IBM patents without paying royalties.  “Here are the raw tools.  Go make something interesting, '' said Bob Sutor, IBM vice president of standards and open source.

IBM, which is orienting its business around standards-based technology, hopes the move will open up more markets for its products and services.[29]  The announcement is part of a larger movement within the technology world.  Corporate clients, tired of the cost and complexity of installing new systems, are demanding software that more easily connects with products from different companies.  So technology giants, such as IBM, Microsoft, Oracle and SAP, are collaborating to create new EDP standards.

Reading Recommendations

A national survey asked university faculty about the impact of Internet use on teaching and research.[30]  There is general optimism, though little evidence, about the Internet's impacts on their professional lives.  The findings suggest that institutions of higher education still need to address three broad areas (their own infrastructure, professional development, and teaching and research) to assist faculty to make good Internet use.

Michael Polanyi’s Personal Knowledge

This book by the famous physical chemist and philosopher speaks for itself. [31]  It is

an enquiry into the nature and justification of scientific knowledge.  But my reconsideration of scientific know­ledge leads on to a wide range of questions outside science.  I start by rejecting the ideal of scientific detachment. In the exact sciences, this false ideal is perhaps harmless, for it is in fact disregarded there by scientists.  But … it exercises a destructive influence in biology, psychology and sociology, and falsifies our whole outlook far beyond the domain of science.  

[The title words], Personal Knowledge. … may seem to contradict each other: for true knowledge is deemed impersonal, universally established, objective.  But the seeming contradiction is resolved by modifying the conception of knowing.    I regard knowing as an active comprehension of the things known, an action that requires skill.  Skilful knowing and doing is performed by subordin­ating … particulars, as clues or tools, to the shaping of a skilful achievement, whether practical or theoretical. …  Clues and tools are things used as such and not observed in themselves.  They are made to function as extensions of our bodily equipment.

The Open Source Maturity Model (OSMM)

In his book, Succeeding With Open Source, Navica CEO Bernard Golden details the Open Source Maturity Model, a tool designed to help organizations determine whether open source products can satisfy their unique support, training, documentation, integration, and services needs.