|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 4, Number 4, 4Q2005 |
|
|
|
|
|||
|
|
HMG
Consulting |
©
2005, H.M. Gladney ISSN: 1547-8610 |
There is a story of a dog-owner who prided himself on the perfect training of his pet. Whenever he called: 'Here! will you come or not!' the dog invariably either came or not. That is exactly how electrons behave when controlled by probability. Michael Polanyi, Personal Knowledge, p.21
I have
kept to three fundamental principles:
·
always
to separate sharply the psychological from the logical, the subjective from the
objective;
·
never
to ask for the meaning of a word in isolation, but only in the context of a
proposition;
· never to lose sight of the distinction between concept and object. Gottlob Frege[1]
Colleagues and I began to read scientific philosophy
about five years ago simply because we had heard of the difficulty of
understanding Wittgenstein’s Tractatus
Logico-Philosophicus[2] and relished the challenge—a reason similar to that
stated by mountain climbers.[3] Although it is
retrospectively obvious that epistemology is the right foundation for thinking
about digital preservation, this became apparent to us only when we grokked[4] that it had to do with understanding what other people
said.
We have learned
that every name, phrase, or sentence has (at least) two senses. (Here, sense
is almost, but not quite, a synonym for meaning.) The first sense—an extension—is a relationship to some specific conceptual or real
world object—a "nominatum".
The second sense—an intension—is
a relationship within the context at hand.[5]
For
instance, the extension both for "the morning star" and also for
"the evening star" would be the planet Venus. The
intension for "the morning star" might be "the brightest
pinpoint light in the sky in the wee hours", with a corresponding sense for
"the evening star". It is an exercise in reasoning and science
to recognize that the extension corresponding to the two different names is a
single real-world object.
Before I read Davidson’s analysis,[6] I had thought of intension
and extension only in terms of general expressions such as "stars"
(rather than "the evening star", which philosophers call a particular
expression, or a proper name). That is because we were trying to
understand why it might be important to distinguish between what was meant by
the word ‘class’ and the word ‘set’.
Natural language has many words to denote “a number of objects that are somehow related”—collection, set, and class are prominent. Each of these also has a technical meaning—a meaning used by some specialized community. Librarians use collection to denote intellectual works that might be acquired because they conform to a policy decided by their institution. Mathematicians use set, but decline to define it because the words that might appear in a definition are themselves defined in terms of set.[7] (This is similar to the treatment of ‘point’ and ‘line’ in modern accounts of Euclidean geometry.) Philosophers often use class to denote objects indicated by a natural language phrase, such as “spherical objects” or “balls”.
Such care is necessary for clarity. While an aggregation might conform to what is intended by ‘set’, it might also conform to what is intended by ‘class’, or it might instead satisfy neither intension. Whether or not a specific class corresponds to a particular set is a matter for careful inquiry.
Until now, DDQ has applied scientific philosophy only to knowledge communication as part of digital preservation. It has much, much wider applicability. For instance, it can illuminate a current political debate.
In the
The acid test differentiating a religious or metaphysical belief from a scientific theory is identification of feasible empirical observations that might yield refutations of the theory in question.[8] Such observations would yield objective scientific facts rather than unprovable value judgments. DDQ therefore suggests asking ID design proponents to identify observational tests whose results could conceivably refute their non-evolutionary theories.[9]
In news that appeared just as this DDQ number was being
readied for distribution, a
"We have concluded that it is not [science], and moreover that ID cannot uncouple itself from its creationist, and thus religious, antecedents. …
"To be sure,
A local newspaper quotes Judge Jones, “The breathtaking inanity of the [school] board's decision is evident when considered against the factual backdrop which has now been fully revealed through this trial.''
Until recently, I was confident that the expression “digital preservation” meant something like “measures to mitigate the deleterious effects of technology obsolescence, media degradation, and fading human memory.” However recent digital repository literature suggests that some authors adopt a much broader definition.
For instance, the promotional literature for MIT’s DSpace software claims digital preservation support. DSpace uptake might be enhanced by this claim being uncritically accepted by commentators. In a time-consuming dig into what justifies, “The DSpace digital repository system … preserves … digital research material,” I sought preservation functionality beyond what competing content management packages provide.[13] I found very little.
A DSpace objective—perhaps the most important justification for its funding—is to help MIT faculty save work that would otherwise not be published, such as experimental measurements. We might reasonably call this a “digital preservation” activity, since without encouragement and help its originators would probably not try to ensure that their primary data survived their individual ephemeral interests. On the other hand, both this purpose and the means for achieving it differ from what the seminal preservation task force report calls for .[14]
Using a single phrase for different activities is anything but helpful towards progress in a topic that many people are trying to understand.
I sometimes collect a grocery list in a digital file. As a side effect of backing up valuable files, copies of such lists are saved on optical disks stored in a safety deposit box. Should I describe this mechanism as a contribution to digital preservation technology? What should we adopt as distinctions that facilitate precise communication?
D-Lib Magazine 11(12) contains five articles about repository ingestion of content that DDQ has not considered as much as it should have—collections of digital objects, such as Web pages, which each individually cost little to create and which come from many authors who are not administratively related.[15] Reported are results of ingestion tests for about 57,000 Web cullings—too large a number to pay curators to inspect each item individually.
These articles teach not only that DDQ needs to enlarge the class of materials it considers, but also identify specific important tests that ingestion software should make for every kind of preservation candidate.
They say you can’t do it, but remember, that doesn’t always work. Attributed to Casey Stengel
Aschenbrenner suggests that “costs of a digital repository are hard to calculate due to the lack of hands-on data from other initiatives. … the lack of experience with digital preservation costs obstructs a complete picture. In general, … costs are assumed to be … even greater in the digital environment than for paper.”[16]
While this assessment might be reasonable, it and other preservation literature mostly ignore salient points:
Ø Costing assumptions used by different institutions differ significantly, including in their estimates of the cost of money, the cost of labor and inflation, and how costs are allocated between distinct activities that share resources. An example is the cost of supporting end-user access for new collections in an existing digital library.
Ø The purpose of a cost estimate is likely to influence how it allocates cost contributions. For instance, estimates to persuade funding for a new activity with staff and other resources not currently in place will look different from estimates for expanding established activities.
Ø Comparing the cost of digital services with paper-based services should be accompanied by stating objectives, which are likely to be quite different. For instance, an MIT Libraries objective is to enhance the reputation of the Institute by making its contributions much more visible than they already are—an objective that cannot be achieved by spending on its paper-based collection.
Ø The cost of housing can be large. For instance, if existing facilities are running out of space for storing new paper, it is not at all obvious that digital preservation costs will exceed those for paper.
Ø The costs for hardware and software will depend on what is already installed and on technology acquisition assumptions. For instance, the costs associated with open source software will be distributed differently than those with commercial content management software.[17]
Ø It can help significantly to represent cost estimates in a spreadsheet that allows refinements as numerical estimates are improved by new information, that makes “what if” experiments readily achieved, and that allows the effects of different costing models to be assessed. The “bottom line” number,[18] which will be briefly used and then ignored, might not be the most important result of cost estimation. In contrast, the organizational and technical insights accumulated in the exercise can have enduring utility.
Digital repository literature makes surprisingly little allusion to employing professional accountants. Cost estimation is a standard subject in all professional accounting programs.[19] There is a plethora of books on the topic—too many for anyone except a professional to recommend which might be good for any particular case.
Stimulated by a
The approach faces a perplexing
challenge. Apparently next to no-one
wants to use such checklists. We recognize
that doing so would surely be tedious.[21] However, most repositories need
to accommodate many detailed requirements.
Since “the devil is in the details,” I know no alternative to working
with and refining careful detail lists, discussing their individual items with
the eventual software users, and using such lists both for software development
and software selection.
The dilemma is far from new. Numerous 1980’s software development studies
resulted in “best practices” expositions that next to nobody either objected to
or followed, even though the business press had many articles about software
project cost over-runs, schedule disappointments, and outright failures partly
caused by inadequate appreciation of requirements.[22] This dilemma is still with us.
In a Fortune Magazine interview, Fred Brooks observed that, while technology managers widely quote his 1975 book, The Mythical Man-Month, few actually follow its recommendations. [23] Brooks, who managed IBM's OS/360 software, argued that adding more people to a software project that is behind schedule slows it even further. This is because adding people causes more bureaucracy and more needed training. It can be better to slip the schedule, limit the scope, and/or phase features into later versions. Brooks is not surprised that managers continue to make the same mistakes.
Can any reader suggest how to make progress, given that library patrons are sensitive to many, many nuances?
Reasons for the slow progress toward practical digital preservation systems include insufficient partitioning into system components that can be addressed with only small interdependencies, and failure to distinguish between aspects that have long been handled well by extant software offerings (e.g., IBM Content Manager—obviously a biased example, given my personal history) and additions needed to such well-understood technology.
Borghoff
identifies more than 70 non-commercial repository offerings.[24] Why are there so many? Have their authors looked at competitors’
offerings? They do not seem to mention
them. Their publications rarely identify
and claim specific novelty in their work.
Are they aware of the functionality and quality of commercial offerings,
or of what their open-source competitors have to offer? Are we seeing another example of “publish or
perish”?
For whose benefit are all these packages being created? The number of packages makes it unlikely that potential repository institutions will even read the descriptions of more than a small fraction, much less evaluate them. Given the lip service paid to sharing open source software, this situation is bizarre.
At
a recent
In summary, much of the current work on digital repositories seems wasteful because it reproduces what might be acquired at less expense than its likely development costs. In many cases, our tax dollars are funding this! Surely these dollars could be better spent—perhaps on truly new preservation work.
Among the flood of open-source articles,[25] many recommend uncritically. Few are clear what ‘open source’ means. There seem to be at least three distinct meanings: (1) software whose source code is made widely available—often to anyone who cares to inspect and perhaps to modify it; (2) software that is available for installation and use without licensing fee; and (3) software that is not produced or controlled by for-profit enterprises.
In an example of a
commercial open-source project, we read that "
Recall that until an antitrust consent decree forced a 1970’s change, IBM “gave away” software to its hardware customers. That regulatory conditions have changed to permit partial reversion is suggested by a recent announcement in which IBM figures prominently. Several major storage companies have formed an alliance to provide infrastructure software based on standards.[27] "Storage vendors have taken a significant step forward today to unite under a common goal of providing flexibility, simplicity, efficiency, and common standards for customers," said ACM President David Patterson, "Collaboration at this level is the only way we will manage and overcome the information explosion we are seeing today."
To add to the confusion, the relationship of open source packages to publicly defined and maintained information processing standards is often important, but there seem to be no widely accepted assertions about the extent to which standards adherence is required for software to be called “open source”. The term is in danger of degrading into nearly meaningless hyperbole.[28]
“Asynchronous
Javascript and XML” (AJAX) is
a technique for creating interactive web applications using a combination of
XHTML, cascading stylesheets, and the Document
Object Model manipulated through JavaScript to
display and interact with the information presented.
In October, IBM made its entire invention portfolio—45,000 worldwide patents—available without charge to support health care and educational software standards. Developers who use those standards—in the areas of Web services, electronic forms, and open documents—will be able to exploit IBM patents without paying royalties. “Here are the raw tools. Go make something interesting, '' said Bob Sutor, IBM vice president of standards and open source.
IBM, which is orienting its business around standards-based technology, hopes the move will open up more markets for its products and services.[29] The announcement is part of a larger movement within the technology world. Corporate clients, tired of the cost and complexity of installing new systems, are demanding software that more easily connects with products from different companies. So technology giants, such as IBM, Microsoft, Oracle and SAP, are collaborating to create new EDP standards.
A national survey asked university faculty about the impact of Internet use on teaching and research.[30] There is general optimism, though little evidence, about the Internet's impacts on their professional lives. The findings suggest that institutions of higher education still need to address three broad areas (their own infrastructure, professional development, and teaching and research) to assist faculty to make good Internet use.
This book by the famous physical chemist and philosopher speaks for itself. [31] It is
an enquiry into the nature and justification of scientific knowledge. But my reconsideration of scientific knowledge leads on to a wide range of questions outside science. I start by rejecting the ideal of scientific detachment. In the exact sciences, this false ideal is perhaps harmless, for it is in fact disregarded there by scientists. But … it exercises a destructive influence in biology, psychology and sociology, and falsifies our whole outlook far beyond the domain of science. …
[The
title words], Personal Knowledge. … may
seem to contradict each other: for true knowledge is deemed impersonal,
universally established, objective. But
the seeming contradiction is resolved by modifying the conception of knowing. … I
regard knowing as an active comprehension of the things known, an action that
requires skill. Skilful knowing and
doing is performed by subordinating … particulars, as clues or tools, to the
shaping of a skilful achievement, whether practical or theoretical. … Clues and tools are things used as such and
not observed in themselves. They are
made to function as extensions of our bodily equipment.
In his book, Succeeding With Open Source, Navica CEO Bernard Golden details the Open Source Maturity Model, a tool designed to help organizations determine whether open source products can satisfy their unique support, training, documentation, integration, and services needs.