|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 5, Number 2, 2Q2006 |
|
|
|
|
|||
|
|
HMG
Consulting |
©
2005, H.M. Gladney ISSN: 1547-8610 |
Assessment reports of attractive quality have appeared in recent months. In addition, recent repository and preservation software releases merit careful inspection by institutions ready for practical deployment. Selections are identified below.
The July 2006 Internet Resources Newsletter
identifies more than a score registries of repositories and about as many
subject-specialized repositories.
After commissioning the National
Library of the
The National Library of Australia carried out a survey of guidance documents for preserving digital materials: Report to ICABS on guidance for digital preservation: Report on a survey of sources.
A 3-year progress report of German Nestor project is disappointing.[1] The main talks, mostly by institution managers, provide little more than repetitions of requirements and challenges that were published over three years ago (i.e., before Nestor began) and calls for cooperation in building networks and sharing experience.[2] Many prior articles by other authors have said these things before.
The British Broadcasting Corporation (BBC) faces preservation challenges not much treated in the literature—challenges occasioned by its 900 kilometers of shelves holding at-risk analog recordings on scores of different media formats. Richard Wright provides a fascinating introduction to the BBC's current preservation guide, a work in progress.
In a related report, the [
A paper from Université Paris (Sorbonne), Search Engines and
On-line Museum Access on the Web, describes challenges
for making museums accessible to virtual visitors and suggests solutions.
Relatively little digital preservation news covers
engineering documents. Modern Relics in the June (USA) issue of Government
Computer News describes [
In 2005 I participated in an assessment of repository packages for a not-for-profit institution; the participants rejected D-Space. My personal opinions in this were that D-Space was designed for larger institutions than the one at hand. I also I discovered internal software quality management issues. I recommend an ETD conference paper, Lisa Atkinson’s The Rejection of DSpace: Selecting Thesis Database Software for the University of Calgary Archives.[4]
Deanna Marcum’s The Future of Preservation keynote IFLA speech in March provides quantitative perspective of digital preservation in the wider context of preservation of all documents of interest to research libraries. Its citation list is particularly recommended.
Recent reports5 suggest that the midpoint of NDIIPP funding is a time to take stock and adjust the program to address significant omissions.
For instance, I am curious how well the following charge has been met: “Congress named specific government agencies and private-sector nonprofit groups LC should work with. They also indicated the need to find partners in the commercial and technical communities. "The information and technology industry that has created this new medium should be a contributing partner in addressing digital access and preservation issues inherent in the new digital information environment" ” [5] It seems appropriate to inform the public (not merely specialists) to what extent the skills and communities engaged reach beyond librarians/archivists and a small number of large institutions. The public also deserves to know how the scope of content addressed by NDIIPP compares with that of content of broad national interest.
I have searched public information for reports of aspects of public interest, but have not found answers to questions such as:
Ø What fraction of the (approx.) $100M funding has NDIIPP actually expended? What has actually been achieved? Project by project, what has been learnt?
Ø How much matching grant funding (as called for in the enabling legislation) has been obtained? Which commercial organizations, if any, have contributed? Who are the other contributors?
Ø What is being done to help individual citizens save digital information in preservation-ready forms? (Unpublished work by famous people often is held privately for decades before cultural heritage institutions recognize its worth. For instance, Leonard Bernstein’s papers were saved by his family for about 40 years before the Library of Congress acquired them.)
My search for such
information has been unsuccessful. I
will continue to search and would much appreciate DDQ readers’ hints where to
find it.
|
Figure 1: MPEG-21 DIDL structure[6] |
Although colleagues and I believe we know how to preserve any type of digital object reliably, we have yet to create software for convenient realization our ideas. This includes document packaging suggested in Figure 2 of DDQ 3(3) and, with more detail, in an ACM TOIS paper.[7] There is more than one way to represent the references prominent in the figure. A design for part of the structure is specified in the MPEG-21 Digital Item Declaration Language (DIDL) realized in the recent Los Alamos aDORe offering.
DIDL defines a recursive structure (Figure 1) whose Items contain other Items and Components that contain equivalent Resources. Its Resources are bitstring information representations. Based on these abstract concepts, DIDL defines XML Schema providing flexibility and extensibility for representing complex digital objects.
In addition to open-source document packaging software that implements MPEG-21 DIDL, aDORe[8] provides a repository package that combines two cross-referenced file structures.
· XML-based representations of multiple Digital Objects are concatenated into a single, valid XML file named an XMLtape, together with identifier and timestamp indices to facilitate OAI-PMH-based access.
· ARC files, as introduced by the Internet Archive, concatenate the constituent datastreams of the Digital Objects, and are indexed for OpenURL access.
· Connections between a XMLtapes and associated ARC file(s) are recorded as ARC file identifiers and OpenURL references in XMLtape files.
Many digital preservation articles start by reminding their readers of digital document attributes that differ from their paper-based counterparts, but then pay little attention to managing individual documents in favor of discussing digital library architecture and/or the organization and management of repository institutions. If one truly wants to ensure the long-term digital document utility, the latter approach is suboptimal because representing the documents appropriately much simplifies the challenge, and also allows widely deployed digital library technology to be used with no more than modest upwards-compatible extensions. Putting it otherwise, digital preservation and digital repository are best treated as distinct technologies with modest interactions.[9]
The
distinction is particularly clear in the preservation design of the National Archives
of Australia (NAA),[10] which partitions its system
into three components that share documents only by transported storage media―a
quaranteen server, a preservation server, and a digital repository. The distinction is also inherent in the
For
discussing digital repository structure and comparing designs such as those of
, and the IBM DB2 Data Links Manager
fits into Ñ. The figure complements the well-known
high-level OAIS description.[11]
|
Fig.
2: Repository architecture |
Among other things, the figure illustrates that document preparation activities and management of accession into a collection usually occur on different machines than those housing the collection and providing access to information consumers. This partly occurs naturally because it conveniently responds to the different human roles illustrated and available software tools, and is partly in order to mitigate well-known security risks.4 The most secure backup repositories are never connected to the Internet.
It has long surprised me how little attention the cultural
heritage community seems to pay to commercial software that might inexpensively
address its concerns. The annual Excellence Awards
choices of eWeek Magazine identify two packages worthy of
examination by repository managers.
The BMC Identity Management Suite addresses the complexity of enterprise identity management and access control systems, while not cutting back on capability. The Excellence Awards judges say that it is an excellent job of integrating all the parts of an enterprise ID management system with powerful workflow and reporting options.
To prepare for disaster recovery, Onaro's SANscreen Replication Assurance helps IT managers ensure that the volumes they replicate from a primary data center to failover sites are consistent and ready for action. It lets IT managers quickly visualize their storage resources and locate potential flaws in their disaster recovery, helping them guarantee that that remote data is consistent.
If your institution has many books that it would like to scan, you might want to consider a machine that can scan about 1000 pages/hour. The BookDrive DIY uses digital cameras that are faster than overhead or flatbed scanners, and holds a book in a V-shaped cradle.[13] There is also a version with an automatic page turner. This hardware is suitable only for books for which risks of binding damage are acceptable in exchange for the added value of making those books searchable or of on-line access (if intellectual property considerations do not obtrude.)
Much of what today passes as common sense was not so recognized a century ago, but in fact originated with scholars who called themselves philosophers. For instance, today's orderly structure of archives and saving information laboriously collected became common only a century ago. The following advice from Charles Sanders Peirce, an American not appropriately valued in his own lifetime, might have been part of what caused the change.
An indispensable condition of systematization of any kind is systematic records. Everything worth notice is worth recording; and those records should be so made that they can readily be arranged, and particularly so that [they] can be rearranged. I recommend slips of stiff smooth paper ... [to] note every disconnected fact that you see or read that is worth record. Besides that, you want a book of the same size ... Each book is a connected record of some little investigation. At the end of a year, you have 10000 to 20000 slips which you have arranged ... After thirty years of systematic study, you have every fact at your fingers' ends. ... There will not be a man or any subject of interest that you ever had dealings with, whose whole character you will not be able to survey at pleasure. Peirce, Training in Reasoning, 1898.
Readers new to DDQ might wonder why this newsletter pays so
much attention to epistemology[14]
and why the topic appears immediately after sections on long-term digital
preservation. An objective reason is
that epistemology treats the questions, “What can we know about the
world, in contrast to what might we believe, or find beautiful, or judge to be
good?” and therefore, “What can we communicate to others of what we know?” Of course, information preservation is an
attempt to communicate knowledge.
A subjective reason is that the topic fascinates me, as it fascinates many trained in the exact sciences. A good way to refine one's personal understanding of a topic is to attempt to explain it to others. And writing carefully is fun, if so only for relatively few people.
Some friends and I came to philosophy only after we had retired from gainful employment. As one might expect from our science and engineering backgrounds, the branch of philosophy that drew our attention was epistemology.
A cousin had suggested Ray Monk's Ludwig Wittgenstein: A
Duty of Genius, having himself read it because his son had been an
We accomplished the challenge by discussing Wittgenstein's Tractatus
Logico-Philosophicus sentence by sentence in weekly meetings, first in the
These sessions have now continued for about 5 years. After Wittgenstein (including his posthumous Philosophical Investigations), we continued with selected logical empiricist works, partly because 1926 Vienna Circle discussions analyzing the Tractatus had inspired our own discussions.
We have started to examine Charles Sanders Peirce and,
concurrently, Immanuel Kant. Peirce,
mostly ignored until recently, paralleled and often anticipated European
philosophers. Selection is a big challenge
in reading his work, because he wrote extensively on many topics. We are finding helpful his 1898
Kant's Critique of Pure Reason (CPR)[16] shares with Wittgenstein's Tractatus the reputation of being a difficult read. We therefore started with Prolegomena,[17] which Kant wrote two years later as a teacher's introduction to CPR and have progressed through 30% of its text. Even though we cannot and need not study CPR with the detail we gave to the Tractatus (it is much longer, but we are more practiced than five years ago), our CPR focus is likely to last about a year. Luckily, we are not in a hurry.
I have gradually come to the opinion that, for students of epistemology, Kant's CPR and Wittgenstein's Tractatus constitute two essential nexuses. Each stimulated a fundamental change of knowledge theory. Neither can properly be ignored by anyone who later writes about their topics. Together they create a core understanding that is often taken for granted.
The last point was underlined for me in a recent discussion with a young lady educated in engineering and working in commercial auditing. She asked what about the topic interested me sufficiently to write about it. I attempted a brief summary of central notions, without commenting on thinking that preceded CPR or Tractatus. Her reaction was along the lines of: “Of course. That's obvious!”
Perhaps so, but this is probably only because both her engineering professors and my science professors had absorbed ideas refined by Kant's and Wittgenstein's successors. Their ideas became embedded in technical education in the last half of the 20th-century.
In everyday language it very frequently happens that the same word has different modes of signification … or that two words that have different modes of signification are employed in … superficially the same way. … In this way the most fundamental confusions are easily produced (the whole of philosophy is full of them). Wittgenstein, Tractatus Logico-Philosophicus 3.323
Students of the exact and biological sciences are sometimes
derisive of other disciplines using 'Science' as part of their names, perhaps because
they suspect attempts to share in the 20th-century prestige of
chemistry, physics, biology, and medicine.
They express this opinion with a glib gibe, “If it has 'Science' in its
name, it isn't a science.” [18]
If this opinion is reprehensible, I was guilty until long after I had finished formal studies. Today, years after I quit chemistry and physics to take up Computer Science, I regard the latter topic more as an engineering discipline than as a science.[19] As for Information Science, I am still trying to decide what to think.
The word 'science' originates in the Latin 'scire', which is “to know, to have knowledge of, to experience, to have learned”.[20]
An authoritative start for an inquiry into the history of what 'science' means is with Immanuel Kant's writings, because his most influential work grappled with making metaphysics into a science. In fact, the following passage from Prolegomena[21] illustrates that contempt by scientists for non-scientific scholars is not a new phenomenon!
My purpose is to persuade all those who think metaphysics worth studying that it is absolutely necessary to pause a moment and ... to propose first the preliminary question, "Whether ... metaphysics be even possible ...at all?"
If it be science, how is it that it cannot, like other sciences, obtain universal and lasting recognition? ... , we must come once for all to a definite conclusion respecting the nature of this so-called science, which cannot possibly remain on its present footing. It seems almost ridiculous, while every other science is continually advancing, that in this, which pretends to be wisdom incarnate, for whose oracle everyone inquires, we should constantly move round the same spot, without gaining a single step. ...
...
The question whether a science be possible presupposes a doubt as to its actuality. But such a doubt offends the men whose whole fortune consists of this supposed jewel; hence he who raises the doubt must expect opposition from all sides. Some, in the proud consciousness of their possessions, which are ancient and therefore considered legitimate, will take their metaphysical compendia in their hands and look down on him with contempt; … Immanuel Kant in the Introduction to Prolegomena.21
This quotation reminds us that the meaning of 'science' is
much changed in the last four centuries, as is its German equivalent, 'Wissenshaft'. This is surely related to the fact that
considering philosophy, physics, and mathematics to be distinct disciplines is
a 20th-century development.
For instance,
Philosophy and physics continue to be intertwined. In fact, some people call 20th-century epistemology “scientific philosophy.” [23] The distinction is that epistemology concerns itself with justifications common to all laws of nature, reaching backward from observations and laws to the fundamentals of human reasoning: “On what logical grounds can we be confident of any set of laws?” In contrast, physics concerns itself with choosing specific laws that are as general and economical as possible in their ability to describe what we can and do observe, reasoning iteratively back and forth between conjectured laws and observations, and seeking observational and logical discrepancies that might invalidate some supposed law.
Antiquarians might cite the Latin origin ‘scire’ as justification of the propriety of appellations such as 'Library Science' and 'Information Science', because these are based in knowing what people said and wrote. However, the subject matter of (natural) science is knowledge of the world rather than knowledge of human literature. Every formal discipline requires that its practitioners know its literature. If that were the criterion we could properly attach ‘Science' to the name of any university department. Doing so would diminish the role of the word in signaling helpful distinctions.
1950’s courses in library management and administration grew into “Library Schools” in the 60’s, “Library Science” faculties in the 70’s, and finally into today’s faculties of “Information Science” within what many universities call a “Faculty of Arts and Science”. Why is the topic not called “Information Arts”? People who write about digital preservation mostly come from liberal arts backgrounds. Do they regret missing scientific components in their education, valuing science over arts?
Controversy over the use of the word 'science' is likely whenever people suspect that some speaker is extending our language in order to gain approbation for his own activities or assertions, rather than to describe or distinguish them from other topics.[24]
Wikipedia has adopted changes to its “Anyone
can edit” policy because of malicious changes to its previously
entirely open pages. A critic has announced The death of Wikipedia.
The number of deadline extensions of conference Call for Papers for preservation, archiving, and digital libaries seems to be growing. What's happening? Is the cause merely higher travel costs? Are the number of conference topical overlaps increasing? (There do seem to be more conferences in third-world venues.) Or is it that people are finding fewer novel things to say in these topics?
On July 6, Microsoft Corp. announced that it would make
components of its Office™ suite compatible with the Open
Document Format (ODF) standard.
This policy change is reportedly in response to government requests from
Microsoft said that the Office 2007 will include menu options for XML, ODF, and Adobe's PDF formats. According to LinuxWorld, a Word prototype will be posted on SourceForge. A final version of the Word translator is to be available in late 2006, with Excel and PowerPoint translators to follow in 2007.
The Washington Post (April 11)
reported that the National Archives helped keep secret a multi-year effort by
the Air Force, the CIA and other federal agencies to withdraw thousands of
historical documents from public access on Archives shelves, even though the records
had been declassified. It took three
years for the National Security Archive to respond to a Freedom of Information
request memorandum revealing that Archives officials agreed to help pull the
materials for possible reclassification and conceal the identities of anyone
participating in the effort.
The New York Times (April 19) reported that, after resisting for decades, the German government agreed to open the Bad Arolsen archive. This repository is one of the largest Holocaust archives in the world, with 15 miles of shelving holding approximately 50 million documents, some seized by the Allies as they liberated concentration camps. Museum officials hope to make the documents available for computer viewing at Holocaust research centers around the world.
In May, the
The Free Standards Group has made available LSB 3.1, a Software Development Kit (SDK) that makes it easy for developers wishing to build portable Linux applications.[25] This is the first LSB version to include explicit Linux desktop application support. Major Linux distributors have indicated that they plan to certify their versions of Linux to LSB 3.1. This is expected to encourage Linux desktop adoption by providing a cohesive desktop environment.
Of late there has been much positive commentary on Asynchronous JavaScripting and XML (AJAX) tools. InformationWeek[26]
reminds us that users accepting
Readers have probably noticed renewed emphasis on
information retrieval R&D in the last year.
For instance,
Browster offers a free browser add-on that is faster and simpler than conventional link following. Placing the cursor over linked text immediately pops up the associated link target in a fresh window that stays “alive” only as long as it is in use.
BASE (Bielefeld Academic Search Engine) a
multi-disciplinary scholarly Internet source at
· Intellectual selection of resources
· Searches metadata and full text (depending on the data source)
· Access to "deep web" content not indexed by commercial search engines (such as 500,000 digitised pages of historical journals and reviews of the German Enlightment)
· Search results as bibliographic data and full text hits
· Sorting of search results
· Search refinement for authors, keywords, document type, language etc.
IBM's approach to corporate search structure uses concepts and facts gathered both from databases and unstructured data, and is radically different from using keyword searches familiar to users of consumer-oriented search engines such as Yahoo and Google. It is organized with a framework called the open Unstructured Information Management Architecture (UIMA) standard, which IBM has made public. UIMA is a framework and toolkit for integrating structured data and unstructured information, with freely available Software Development Kit (SDK) recently released by IBM and available for download.
There are some information retrieval tasks for which keyword searches are quite tedious.[28] For instance, have you ever spent an entire weekend planning a trip by surfing the Web to book hotels, travel and special events? Or to compare the features, prices, and reviews of digital cameras before deciding which one to buy? There are new Web services that allow consumers to share synopses of such investigations.
One such service is Kaboodle, which gathers links, photos, summaries and notes onto a single Web page that can be shared easily with others. It offers any user a browser add-on button to click whenever he wants to save a search result, which is created as a single Web page named for a research project—for example, “My trip to Hawaii.” This page collects links to search results, summaries of searched pages, photos and comments. The user can then decide whether to make the Kaboodle page publicly searchable or whether to let in only certain people.
For each of the search result pages saved, Kaboodle automatically extracts relevant information, such as pricing and feature information for commercially relevant pages.
Kaboodle, financed by advertisements, has attracted 35,000 registered users since launching in October, says co-founder and Chief Executive Manish Chandra. But whether people will flock to Kaboodle in large enough numbers for it to make money is still an open question.
Similar technology is available from OnFolio[29], Jeteye, and Plum. Each organizes web research for sharing in e-mails, blogs, and documents, capture local copies of web content for reliable access. These offerings are less focused on shopping than is Kaboodle. Offerings from Del.icio.us and Furl, are more focused on letting people share their bookmarks and less on saving full research projects.
A law school blog points at ways to locate people.