Digital Document Quarterly

Perspectives on Trustworthy Information

Volume 5, Number 2, 2Q2006

 

 

 

DDQ Home

Citations

Glossary

HMG Consulting

Saratoga, CA 95070

©  2005, H.M. Gladney

 

ISSN: 1547-8610

 

Digital Preservation

Assessment reports of attractive quality have appeared in recent months.  In addition, recent repository and preservation software releases merit careful inspection by institutions ready for practical deployment.  Selections are identified below.

The July 2006 Internet Resources Newsletter identifies more than a score registries of repositories and about as many subject-specialized repositories.

Reports of Special Interest

After commissioning the National Library of the Netherlands (Koninklijke Bibliotheek) in a 2004-5 survey on the use and development of standards in digital archiving, the IFLA-CDNL Alliance for Bibliographic Standards (ICABS) has called for comments on its activities.  The KB report is Networking for digital preservation: Current Practice in 15 National Libraries.

The National Library of Australia carried out a survey of guidance documents for preserving digital materials: Report to ICABS on guidance for digital preservation: Report on a survey of sources.

A 3-year progress report of German Nestor project is disappointing.[1]  The main talks, mostly by institution managers, provide little more than repetitions of requirements and challenges that were published over three years ago (i.e., before Nestor began) and calls for cooperation in building networks and sharing experience.[2]  Many prior articles by other authors have said these things before.

The British Broadcasting Corporation (BBC) faces preservation challenges not much treated in the literature—challenges occasioned by its 900 kilometers of shelves holding at-risk analog recordings on scores of different media formats.  Richard Wright provides a fascinating introduction to the BBC's current preservation guide, a work in progress.

In a related report, the [U.K.] Arts and Humanities Data Service has requested comments on the final draft of its Moving Images and Sound Archiving Study.  It suggests that “The long term preservation of moving images and sound resources however is challenging ... primarily due to their complex nature. ... relatively little attention has been paid to the long-term durability and accessibility of multimedia files [and] development of technical solutions and agreed metadata sets lags behind that of other resource types.”

A paper from Université Paris (Sorbonne), Search Engines and On-line Museum Access on the Web, describes challenges for making museums accessible to virtual visitors and suggests solutions.

Relatively little digital preservation news covers engineering documents.  Modern Relics in the June (USA) issue of Government Computer News describes [U.S.] National Institute of Standards and Technology thinking about a start on selection and long-term curation for engineering designs.[3]

In 2005 I participated in an assessment of repository packages for a not-for-profit institution; the participants rejected D-Space.  My personal opinions in this were that D-Space was designed for larger institutions than the one at hand.  I also I discovered internal software quality management issues.  I recommend an ETD conference paper, Lisa Atkinson’s The Rejection of DSpace: Selecting Thesis Database Software for the University of Calgary Archives.[4]

Deanna Marcum’s The Future of Preservation keynote IFLA speech in March provides quantitative perspective of digital preservation in the wider context of preservation of all documents of interest to research libraries.  Its citation list is particularly recommended.

Digital Preservation in a National Context: NDIIPP at Mid-point

Recent reports5 suggest that the midpoint of NDIIPP funding is a time to take stock and adjust the program to address significant omissions. 

For instance, I am curious how well the following charge has been met: “Congress named specific government agencies and private-sector nonprofit groups LC should work with.  They also indicated the need to find partners in the commercial and technical communities. "The information and technology industry that has created this new medium should be a contributing partner in addressing digital access and preservation issues inherent in the new digital information environment" ” [5]  It seems appropriate to inform the public (not merely specialists) to what extent the skills and communities engaged reach beyond librarians/archivists and a small number of large institutions.  The public also deserves to know how the scope of content addressed by NDIIPP compares with that of content of broad national interest.

I have searched public information for reports of aspects of public interest, but have not found answers to questions such as:

Ø      What fraction of the (approx.) $100M funding has NDIIPP actually expended?  What has actually been achieved?  Project by project, what has been learnt?

Ø      How much matching grant funding (as called for in the enabling legislation) has been obtained?  Which commercial organizations, if any, have contributed?  Who are the other contributors?

Ø      What is being done to help individual citizens save digital information in preservation-ready forms?  (Unpublished work by famous people often is held privately for decades before cultural heritage institutions recognize its worth.  For instance, Leonard Bernstein’s papers were saved by his family for about 40 years before the Library of Congress acquired them.)

My search for such information has been unsuccessful.  I will continue to search and would much appreciate DDQ readers’ hints where to find it. 

Preservation, Repository, and Related Software

Figure 1: MPEG-21 DIDL structure[6]

Although colleagues and I believe we know how to preserve any type of digital object reliably, we have yet to create software for convenient realization our ideas.  This includes document packaging suggested in Figure 2 of DDQ 3(3) and, with more detail, in an ACM TOIS paper.[7]  There is more than one way to represent the references prominent in the figure.  A design for part of the structure is specified in the MPEG-21 Digital Item Declaration Language (DIDL) realized in the recent Los Alamos aDORe offering.

DIDL defines a recursive structure (Figure 1) whose Items contain other Items and Components that contain equivalent Resources.  Its Resources are bitstring information representations.  Based on these abstract concepts, DIDL defines XML Schema providing flexibility and extensibility for representing complex digital objects.

In addition to open-source document packaging software that implements MPEG-21 DIDL, aDORe[8] provides a repository package that  combines two cross-referenced file structures.

·        XML-based representations of multiple Digital Objects are concatenated into a single, valid XML file named an XMLtape, together with identifier and timestamp indices to facilitate OAI-PMH-based access.

·        ARC files, as introduced by the Internet Archive, concatenate the constituent datastreams of the Digital Objects, and are indexed for OpenURL access.

·        Connections between a XMLtapes and associated ARC file(s) are recorded as ARC file identifiers and OpenURL references in XMLtape files.

Many digital preservation articles start by reminding their readers of digital document attributes that differ from their paper-based counterparts, but then pay little attention to managing individual documents in favor of discussing digital library architecture and/or the organization and management of repository institutions.  If one truly wants to ensure the long-term digital document utility, the latter approach is suboptimal because representing the documents appropriately much simplifies the challenge, and also allows widely deployed digital library technology to be used with no more than modest upwards-compatible extensions.  Putting it otherwise, digital preservation and digital repository are best treated as distinct technologies with modest interactions.[9]

The distinction is particularly clear in the preservation design of the National Archives of Australia (NAA),[10] which partitions its system into three components that share documents only by transported storage media―a quaranteen server, a preservation server, and a digital repository.  The distinction is also inherent in the Los Alamos approach mentioned two paragraphs above.

For discussing digital repository structure and comparing designs such as those of Los Alamos and of NAA, I have found Figure 2 invaluable.  For instance, LOCKSS fits neatly within its box Ï, offsite archive software (see Onaro below) fits into , and the IBM DB2 Data Links Manager fits into Ñ.  The figure complements the well-known high-level OAIS description.[11]

Fig. 2: Repository architecture
suggesting human roles in the use of networked, nested repositories.[12]

Among other things, the figure illustrates that document preparation activities and management of accession into a collection usually occur on different machines than those housing the collection and providing access to information consumers.  This partly occurs naturally because it conveniently responds to the different human roles illustrated and available software tools, and is partly in order to mitigate well-known security risks.4  The most secure backup repositories are never connected to the Internet.

It has long surprised me how little attention the cultural heritage community seems to pay to commercial software that might inexpensively address its concerns. The annual Excellence Awards choices of eWeek Magazine identify two packages worthy of examination by repository managers.

The BMC Identity Management Suite addresses the complexity of enterprise identity management and access control systems, while not cutting back on capability. The Excellence Awards judges say that it is an excellent job of integrating all the parts of an enterprise ID management sys­tem with powerful workflow and reporting options.

To prepare for disaster recovery, Onaro's SANscreen Replication Assurance helps IT managers ensure that the volumes they replicate from a primary data center to failover sites are consistent and ready for action. It lets IT managers quickly visualize their storage resources and locate potential flaws in their disaster recovery, helping them guarantee that that remote data is consistent.

Automatic Book Scanning

If your institution has many books that it would like to scan, you might want to consider a machine that can scan about 1000 pages/hour.  The BookDrive DIY uses digital cameras that are faster than overhead or flatbed scanners, and holds a book in a V-shaped cradle.[13]  There is also a version with an automatic page turner.  This hardware is suitable only for books for which risks of binding damage are acceptable in exchange for the added value of making those books searchable or of on-line access (if intellectual property considerations do not obtrude.)

Epistemology

Why Philosophy is Pertinent to Preservation

Much of what today passes as common sense was not so recognized a century ago, but in fact originated with scholars who called themselves philosophers.  For instance, today's orderly structure of archives and  saving information laboriously collected became common only a century ago.  The following advice from Charles Sanders Peirce, an American not appropriately valued in his own lifetime, might have been part of what caused the change.

An indispensable condition of systematization of any kind is system­atic records.  Everything worth notice is worth recording; and those rec­ords should be so made that they can readily be arranged, and particu­larly so that [they] can be rearranged.  I recommend slips of stiff smooth paper ... [to] note every disconnected fact that you see or read that is worth record.  Besides that, you want a book of the same size ...   Each book is a connected record of some little investigation.  At the end of a year, you have 10000 to 20000 slips which you have arranged ...  After thirty years of system­atic study, you have every fact at your fingers' ends.  ... There will not be a man or any subject of interest that you ever had dealings with, whose whole character you will not be able to survey at pleasure.       Peirce, Training in Reasoning, 1898.

Readers new to DDQ might wonder why this newsletter pays so much attention to epistemology[14] and why the topic appears immediately after sections on long-term digital preservation.  An objective reason is that epistemology treats the questions, “What can we know about the world, in contrast to what might we believe, or find beautiful, or judge to be good?” and therefore, “What can we communicate to others of what we know?”  Of course, information preservation is an attempt to communicate knowledge.

A subjective reason is that the topic fascinates me, as it fascinates many trained in the exact sciences.  A good way to refine one's personal understanding of a topic is to attempt to explain it to others.  And writing carefully is fun, if so only for relatively few people.

Some friends and I came to philosophy only after we had retired from gainful employment.  As one might expect from our science and engineering backgrounds, the branch of philosophy that drew our attention was epistemology.

A cousin had suggested Ray Monk's Ludwig Wittgenstein: A Duty of Genius, having himself read it because his son had been an Oxford University classmate of the author.  I had previously known of Wittgenstein only vaguely, hearing that his work was both important and also difficult.  I plunged in for the same reason that some people climb mountains—a challenge sometimes incites one to prove oneself.

We accomplished the challenge by discussing Wittgenstein's Tractatus Logico-Philosophicus sentence by sentence in weekly meetings, first in the IBM Almaden Research Center and, after the still-employed group members fell away, at home.  I was surprised (but should not have been) that this reading illuminated basic aspects of computer science and, more narrowly, digital preservation.

These sessions have now continued for about 5 years.  After Wittgenstein (including his posthumous Philosophical Investigations), we continued with selected logical empiricist works, partly because 1926 Vienna Circle discussions analyzing the Tractatus had inspired our own discussions.

We have started to examine Charles Sanders Peirce and, concurrently, Immanuel Kant.  Peirce, mostly ignored until recently, paralleled and often anticipated European philosophers.  Selection is a big challenge in reading his work, because he wrote extensively on many topics.  We are finding helpful his 1898 Cambridge lectures, because they are his own late summaries of key topics.[15]

Kant's Critique of Pure Reason (CPR)[16] shares with Wittgenstein's Tractatus the reputation of being a difficult read.  We therefore started with Prolegomena,[17] which Kant wrote two years later as a teacher's introduction to CPR and have progressed through 30% of its text.  Even though we cannot and need not study CPR with the detail we gave to the Tractatus (it is much longer, but we are more practiced than five years ago), our CPR focus is likely to last about a year.  Luckily, we are not in a hurry.

I have gradually come to the opinion that, for students of epistemology, Kant's CPR and Wittgenstein's Tractatus constitute two essential nexuses.  Each stimulated a fundamental change of knowledge theory.  Neither can properly be ignored by anyone who later writes about their topics.  Together they create a core understanding that is often taken for granted.

The last point was underlined for me in a recent discussion with a young lady educated in engineering and working in commercial auditing.  She asked what about the topic interested me sufficiently to write about it.  I attempted a brief summary of central notions, without commenting on thinking that preceded CPR or Tractatus.  Her reaction was along the lines of: “Of course.  That's obvious!”

Perhaps so, but this is probably only because both her engineering professors and my science professors had absorbed ideas refined by Kant's and Wittgenstein's successors.  Their ideas became embedded in technical education in the last half of the 20th-century.

The Word ‘Science’

In everyday language it very frequently happens that the same word has different modes of signification … or that two words that have different modes of signification are employed in … superficially the same way.    In this way the most fundamental confusions are easily produced (the whole of philosophy is full of them).                                                                                                                                                              Wittgenstein, Tractatus Logico-Philosophicus 3.323

Students of the exact and biological sciences are sometimes derisive of other disciplines using 'Science' as part of their names, perhaps because they suspect attempts to share in the 20th-century prestige of chemistry, physics, biology, and medicine.  They express this opinion with a glib gibe, “If it has 'Science' in its name, it isn't a science.” [18]

If this opinion is reprehensible, I was guilty until long after I had finished formal studies.  Today, years after I quit chemistry and physics to take up Computer Science, I regard the latter topic more as an engineering discipline than as a science.[19]  As for Information Science, I am still trying to decide what to think.

The word 'science' originates in the Latin 'scire', which is “to know, to have knowledge of, to experience, to have learned”.[20] 

An authoritative start for an inquiry into the history of what 'science' means is with Immanuel Kant's writings, because his most influential work grappled with making metaphysics into a science.  In fact, the following passage from Prolegomena[21] illustrates that contempt by scientists for non-scientific scholars is not a new phenomenon!

My purpose is to persuade all those who think metaphysics worth studying that it is absolutely necessary to pause a moment and ... to pro­pose first the preliminary question, "Whether ... metaphysics be even possible ...at all?"

If it be science, how is it that it cannot, like other sciences, obtain universal and lasting recognition? ... , we must come once for all to a definite conclusion respecting the nature of this so-called science, which cannot possibly remain on its present footing.  It seems almost ridiculous, while every other science is continually advancing, that in this, which pretends to be wisdom incarnate, for whose oracle everyone inquires, we should constantly move round the same spot, without gaining a single step.  ...

...

The question whether a science be possible presupposes a doubt as to its actuality.  But such a doubt offends the men whose whole fortune consists of this supposed jewel; hence he who raises the doubt must expect opposition from all sides.  Some, in the proud consciousness of their possessions, which are ancient and therefore considered legitimate, will take their metaphysical compendia in their hands and look down on him with contempt; …                                              Immanuel Kant in the Introduction to Prolegomena.21

This quotation reminds us that the meaning of 'science' is much changed in the last four centuries, as is its German equivalent, 'Wissenshaft'.  This is surely related to the fact that considering philosophy, physics, and mathematics to be distinct disciplines is a 20th-century development.  For instance, Newton, identified as one of the greatest scientists of all time,[22] was Professor of Mathematics at Cambridge.  His masterpiece is named Philosophiae Naturalis Principia Mathematica (“The Mathematical Principles of Natural Philosophy”).  'Natural Philosophy' (in German, 'Naturwissenschaft') was the name for physics + chemistry + biology until well into the 19th century, when it was superseded by 'the Natural Sciences', which became abbreviated to 'science' even in such official phrases as 'the Faculty of Arts and Sciences'.

Philosophy and physics continue to be intertwined.  In fact, some people call 20th-century epistemology “scientific philosophy.” [23]  The distinction is that epistemology concerns itself with justifications common to all laws of nature, reaching backward from observations and laws to the fundamentals of human reasoning: “On what logical grounds can we be confident of any set of laws?”  In contrast, physics concerns itself with choosing specific laws that are as general and economical as possible in their ability to describe what we can and do observe, reasoning iteratively back and forth between conjectured laws and observations, and seeking observational and logical discrepancies that might invalidate some supposed law.

Antiquarians might cite the Latin origin ‘scire’ as justification of the propriety of appellations such as 'Library Science' and 'Information Science', because these are based in knowing what people said and wrote.  However, the subject matter of (natural) science is knowledge of the world rather than knowledge of human literature.  Every formal discipline requires that its practitioners know its literature.  If that were the criterion we could properly attach ‘Science' to the name of any university department.  Doing so would diminish the role of the word in signaling helpful distinctions.

1950’s courses in library management and administration grew into “Library Schools” in the 60’s, “Library Science” faculties in the 70’s, and finally into today’s faculties of “Information Science” within what many universities call a “Faculty of Arts and Science”.  Why is the topic not called “Information Arts”?  People who write about digital preservation mostly come from liberal arts backgrounds.  Do they regret missing scientific components in their education, valuing science over arts?

Controversy over the use of the word 'science' is likely whenever people suspect that some speaker is extending our language in order to gain approbation for his own activities or assertions, rather than to describe or distinguish them from other topics.[24]

News

Wikipedia has adopted changes to its “Anyone can edit” policy because of malicious changes to its previously entirely open pages. A critic has announced The death of Wikipedia.

The number of deadline extensions of conference Call for Papers for preservation, archiving, and digital libaries seems to be growing.  What's happening?  Is the cause merely higher travel costs? Are the number of conference topical overlaps increasing?  (There do seem to be more conferences in third-world venues.) Or is it that people are finding fewer novel things to say in these topics?

Microsoft Office to Support the Open Document Format (ODF) Standard

On July 6, Microsoft Corp. announced that it would make components of its Office™ suite compatible with the Open Document Format (ODF) standard.  This policy change is reportedly in response to government requests from Belgium, Denmark, and Massachusetts (and surely also recognizes growing popularity of ODF and Sun’s free OpenOffice™ suite.

Microsoft said that the Office 2007 will include menu options for XML, ODF, and Adobe's PDF formats.  According to LinuxWorld, a Word prototype will be posted on SourceForge.  A final version of the Word translator is to be available in late 2006, with Excel and PowerPoint translators to follow in 2007.

U.S. Government Conceals Declassified Documents

The Washington Post (April 11) reported that the National Archives helped keep secret a multi-year effort by the Air Force, the CIA and other federal agencies to withdraw thousands of historical documents from public access on Archives shelves, even though the records had been declassified.  It took three years for the National Security Archive to respond to a Freedom of Information request memorandum revealing that Archives officials agreed to help pull the materials for possible reclassification and conceal the identities of anyone participating in the effort.

German Holocaust Archives to be Opened

The New York Times (April 19) reported that, after resisting for decades, the German government agreed to open the Bad Arolsen archive.  This repository is one of the largest Holocaust archives in the world, with 15 miles of shelving holding approximately 50 million documents, some seized by the Allies as they liberated concentration camps.  Museum officials hope to make the documents available for computer viewing at Holocaust research centers around the world.

A New Magnetic Tape Information Density Record

In May, the IBM Almaden Research Center announced that had demonstrated a density of 6.67 billion bits per square inch—more than 15 times the data density of today's magnetic tape products―on dual-coat magnetic tape developed by Japan’s Fuji Photo Film Co.  This outdid IBM’s 2002 recording of a terabyte of data onto a single 3592-sized cartridge at a density of 1 billion bits per square inch.  According to IBM, the demonstration shows that magnetic tape data storage should be able to maintain its cost advantage over other storage technologies for years to come.

Linux Standard Base (LSB) Unifies Linux Desktop Standards

The Free Standards Group has made available LSB 3.1, a Software Development Kit (SDK) that makes it easy for developers wishing to build portable Linux applications.[25]  This is the first LSB version to include explicit Linux desktop application support.  Major Linux distributors have indicated that they plan to certify their versions of Linux to LSB 3.1.  This is expected to encourage Linux desktop adoption by providing a cohesive desktop environment.

AJAX Introduces Security Risk for Web Users

Of late there has been much positive commentary on Asynchronous JavaScripting and XML (AJAX) tools.  InformationWeek[26] reminds us that users accepting AJAX macros in Web pages they access are opening their doors to serious security exposures, because AJAX scripts are foreign programs executing on their computers.

Search Technology and Offerings

Readers have probably noticed renewed emphasis on information retrieval R&D in the last year.  For instance, University of California at Berkeley launched an interdisciplinary center for advanced search technologies with about 20 faculty members concentrating on on privacy, fraud, personalization, and multimedia search and led by computer science professor Robert Wilensky.  Since the reservoir of search ideas is large (e.g., ACM Special Interest Group on Information Retrieval has published meeting papers for about 30 years), we anticipate interesting new offerings for some years.  Recent developments that seem particularly noteworthy are identified below.[27]

Browster offers a free browser add-on that is faster and simpler than conventional link following. Placing the cursor over linked text immediately pops up the associated link target in a fresh window that stays “alive” only as long as it is in use.

BASE (Bielefeld Academic Search Engine) a multi-disciplinary scholarly Internet source at Bielefeld University, makes accessible over 2 million documents from 130 online sources.  Its holdings are mostly freely available and can be searched by bibliographic data or full text. Its search features include:

·        Intellectual selection of resources

·        Searches metadata and full text (depending on the data source)

·        Access to "deep web" content not indexed by commercial search engines (such as 500,000 digitised pages of historical journals and reviews of the German Enlightment)

·        Search results as bibliographic data and full text hits

·        Sorting of search results

·        Search refinement for authors, keywords, document type, language etc.

IBM's approach to corporate search structure uses concepts and facts gathered both from databases and unstructured data, and is radically different from using keyword searches familiar to users of consumer-oriented search engines such as Yahoo and Google.  It is organized with a framework called the open Unstructured Information Management Architecture (UIMA) standard, which IBM has made public.  UIMA is a framework and toolkit for integrating structured data and unstructured information, with freely available Software Development Kit (SDK) recently released by IBM and available for download.

There are some information retrieval tasks for which keyword searches are quite tedious.[28]  For instance, have you ever spent an entire weekend planning a trip by surfing the Web to book hotels, travel and special events?  Or to compare the features, prices, and reviews of digital cameras before deciding which one to buy?  There are new Web services that allow consumers to share synopses of such investigations.

One such service is Kaboodle, which gathers links, photos, summaries and notes onto a single Web page that can be shared easily with others.  It offers any user a browser add-on button to click whenever he wants to save a search result, which is created as a single Web page named for a research project—for example, “My trip to Hawaii.”  This page collects links to search results, summaries of searched pages, photos and comments.  The user can then decide whether to make the Kaboodle page publicly searchable or whether to let in only certain people.

For each of the search result pages saved, Kaboodle automatically extracts relevant information, such as pricing and feature information for commercially relevant pages.

Kaboodle, financed by advertisements, has attracted 35,000 registered users since launching in October, says co-founder and Chief Executive Manish Chandra.  But whether people will flock to Kaboodle in large enough numbers for it to make money is still an open question.

Similar technology is available from OnFolio[29], Jeteye, and Plum.  Each organizes web research for sharing in e-mails, blogs, and documents, capture local copies of web content for reliable access.  These offerings are less focused on shopping than is Kaboodle.  Offerings from Del.icio.us and Furl, are more focused on letting people share their bookmarks and less on saving full research projects.

A law school blog points at ways to locate people.

Reading Recommendations

John Barry, The Great Influenza