|
Perspectives on Trustworthy
Information |
Volume 3, Number 3, |
|
|
|
|
|||
|
|
HMG
Consulting Saratoga, CA 95070 |
©
2004, H.M. Gladney ISSN: 1547-8610 |
Appendix: Preservation Research Needs
In what follows, text in this sentence’s font is reproduced from the 2003 form of the Work Group report. In contrast, text in this font is DDQ commentary on the report. All endbnotes are DDQ commentary.
1. Preservation Strategies: Emerging Research Domains
1A: Repositories: There are four areas for research to support the development of repositories:
1) Elaboration of existing repository models leading to technical specifications and standards that can be used to build persistent archives.[48] This would include development of a service layer that would allow distributed repositories to share content, tools and services. Existing models[49] also need to be tested for scalability.
Here and below, the unmodified word ‘repository’ is ambiguous. It could be any or all of: [50]
(i)
A
repository institution considered together with all its staff and service
commitments, such as a university library or a governmental archive;
(ii)
A server shell customized by an institution
for the users of its primary clientele, with this shell being built on the
basis of a core offering; and
(iii)
A widely useful core offering that
encapsulates basic service components, such as database management subsystems
that preserve and serve library catalog information and other metadata,
distributed file servers that hold and deliver bitstreams representing main
collections, search engines, and web servers.
Also ‘scalability’ is ambiguous, possibly referring to core services (iii) being available on different platforms from servers small enough to be inexpensive for small collections to servers that can manage hundreds of millions of objects with concurrent service to thousands of users,[51] and possibly referring to administrative automation to enable large ingestion rates into a repository. We deal with the former here, and the latter in the comments on other WG recommendations.
For the foreseeable future, collection sizes in archival repositories will be constrained by the resources needed to ingest documents.[52] Deployed content management offerings can handle collections several orders of magnitude larger than any archival repository will require in the next decade. For instance, the IBM Content Manager can handle more than 1010 catalog records and can control any number of local or remote file management subsystems, each holding many million digital objects.
There is an immense literature discussing scalability of database management subsystems and another discussing content search methodology.
2) Software repositories. Emulation and salvage and rescue techniques rely on software that may no longer be available for purchase or licensing. The preservation community would benefit from a small number of software repositories that would collect, maintain and distribute obsolete software. Actually making the repository function effectively depends upon new research in such areas as engineering, software, systems and formalised testing.
If “making the repository function effectively” alludes to core software rather than customization software this proposal should be ignored, since adequate repository software has been available for some years, and since no specific preservation-induced shortfalls of this technology have been identified.
The recommendation for repositories to hold obsolete software is about deployment—not about research.
3) Format repositories. Registries of digital formats provide keys to understanding the nature of digital objects, guide the managing of their transition from one state to another, and inform the choice of preservation method for material in specific formats.
In the period since the EU/NSF WG recommendations appeared, the British Public Record Office has established such a registry and numerous widely used formats have been addressed. They have not identified technical shortfalls worthy of research funding.
4) Repositories of peripheral devices. A major obstacle to salvage and rescue, migration and emulation is the difficulty of finding peripheral devices (tape and disk drives, displays, control panels, etc). Research areas include the feasibility of engineering generic connections to enable newer hardware to communicate with legacy peripheral devices.
Here ‘feasibility’ is ambiguous. Connection of past devices is technically feasible, but probably too costly to attract needed engineering resources unless some institution commits to massive capture and reformatting of specific collections that share media. Proper judgement of a proposal could be made only in the light of some specific credible data recovery intention.
The proposal seems to be directed solely at past collections that suffer from problems that are readily avoided for new objects by proper data management—for “retrospective preservation” rather than for “prospective preservation”. Arguably, at this time prospective preservation deserves more attention than retrospective preservation, partly because the dependency on digital data generated in the future data will be much larger than that on already existing data.[53]
There exists an interest group addressing the topic.[54]
1B: Archival Media: To bring new classes of technology to bear on the recovery, reconstruction and interpretation of the meaning represented by bitstreams, they need to be encoded in preservation formats and on ‘archival media’. Research into generating cheap, long-lasting, efficient and verifiable media for storing the bitstreams is needed.
Only industrial enterprises have the skills and resources to bring new media to market, partly because to make “cheap, long-lasting, efficient, and verifiable media” requires immense development efforts affordable only for probable mass markets. Such efforts have occurred from time to time, and opportunities continue to be diligently sought.[55]
1C: Salvage and Rescue: Preservation strategies depend upon our ability to access storage media over time. While we know that some storage media can have a shelf life of thirty years or more, the devices for reading particular classes of media tend to have much shorter lifespans, often only a couple of years. While a peripheral device repository might help here (see above), generic devices capable of reading diverse classes of media are needed to address peripheral device obsolescence.
The invention of “generic devices capable of reading diverse classes of media” is too improbable to be a prudent research investment, because combined high storage capacity, high reliability, high speed, and low cost are, in practice, achieved by engineering the read-write mechanism specifically for each medium and for each device configuration, often packaging and sealing media and read-write heads together for reliability.
1D: Storage abstractions: Preservation systems map between the operations that can be done on digital entity encoding formats and the operations that are supported by storage repositories. As newer classes of storage devices are developed research will be necessary to identify how their emergence will change digital entity encoding formats to take advantage of content-based addressing and parallel processing of data. Holographic storage is an example of a format that will require this kind of research.
To enable new devices to be phased without disruption into existing computing systems, system software has long been layered with middle layers hiding device dependencies from upper layers, as suggested by Figure 1 in DDQ 2(3). Practical software always includes a layer that ‘sees’ data files only as bit-streams. (If holographic storage ever becomes a viable competitor for HDDs, it will surely be interfaced similarly.)
There is a related engineering challenge—identifying the boundary between software essential to rendering an object and lower software layers that hide operating system and similar dependencies from the higher layer. For instance, some word processors have the same behavior in different operating environments (e.g., Sun Microsystems’ OpenOffice™). We would want to preserve only the environment-independent part and to specify the services that it needs of any future computing platform. The problem is difficult because the critical interfaces are not often publicly documented; in some cases, a vendor deliberately holds such information as a trade secret.[56]
1E: Documentation of Functionality and Behaviour: Preserving both digital entities and their underlying technologies depends upon representing their functionality and behaviour. This research should lead to the development of an extensible formal descriptive language for the performance and behaviour of preserved digital entities that would allow future users to measure how far the performance or behaviour of a digital entity deviated from its original performance.
The “performance and behaviour of preserved digital entities” seems to be an allusion to object-oriented programming and related topics. If so, the objective can be achieved by saving executable versions of the program for each behaviour of interest as alluded to below in 3A. This would be more direct and require less end user expertise than analysis of object performance. It is also known to be feasible (see literature cited in 3A).
As to “allow future users to measure how far the performance or behaviour of a digital entity deviated from its original performance.”, we agree with the value of progress in this direction, but remind the reader that closely related researches exist and show that the full problem is exceedingly difficult. We are referring to investigations between 1970 and 1995 into program specification and programming semantics.[57] In contrast, we believe that virtual-machine-based emulation is quite tractable and extremely promising.
1F: Context-aware Digital Entities: Increasingly research into agents and self-awareness among digital entities and systems has demonstrated a rich array of possibilities. We recommend digital archiving research that focuses on context sensitivity, risk awareness and proper preservation behaviour.
The topic of “agents and self-awareness” is not peculiar to preservation. The phrase “self awareness” is an anthropomorphism that has no broadly accepted meaning in computer science or software engineering. Nor has anyone identified to us any topical additional problem raised by preservation needs.
1G: Accelerated Aging: Conservation of analogue items benefits from research into how materials age. There is room for new research into the area of the accelerated ageing of media, systems and software, aimed at predicting the risks to digital objects caused by software obsolescence, changes in standards or product failures in the market, rather than the ageing of physical objects.
Such investigation might be mounted in academic materials science and departments that already have suitable equipment and demonstrated expertise. Otherwise it should be left to industrial development laboratories.
1H: Accumulation and Preservation of Intellectual Capital: The concept of preservation is being extended to include preservation of the knowledge inherent in digital entities and the processes used to create them. For some communities, the ability to analyse the information and knowledge content of digital entities is the most important aspect of preservation systems. This raises complex issues of the semantics necessary to represent temporal, procedural and spatial relationships and the means to relate these relationships to digital entities.
We agree, but urge that support be granted only to research teams that have or commit to develop a deep understanding of epistemology of the first half of the 20th century.[58]
2. Re-engineering Preservation Processes
The expense of current approaches to digital preservation reflects the significant amount of human intervention they involve. Where preservation processes can be automated these costs can be reduced. New research is needed to establish mechanisms to identify processes that can be automated and to develop methods to automate them.
We agree, and suggest that funding be awarded only to projects whose proposals include specific models of information handling and identify, within those models, credible specific suggestions of automation to be developed and tested.
2A: Modelling Preservation Processes: With some exceptions, such as preservation systems needed for digital libraries, a growing body of opinion indicates that the preservation of digital entities can be enhanced if preservation functionality is built into the digital entities or the systems that manage them at the time they are created. This means improving our knowledge about what preservation functionality really is and ensuring that this functionality can be effectively communicated to system developers, modelled and implemented by them.
Our agreement is evidenced by DDQ references to work in progress.
2B: Automation of Processes: The preservation of digital entities depends upon active curation. Human intervention at each stage of the preservation process is not economically viable. Processes that can be automated need to be identified and mechanisms for automating them developed. For example, what are the particular capabilities required to automate the processes of appraisal, accessioning, description, arrangement, preservation and access of digital entities?
We agree with this recommendation, and emphasize its “particular capabilities”.
Not alluded to in the WG recommendations, but extremely interesting, would be software designs that shift much of the information cataloging workload from repository staff to source material authors and editors by requiring them to include most of the eventually required metadata as part of content submissions to repositories. This might be made acceptable and practical by metadata creation and checking tools that test the validity of data entries to the extent that objective tests are feasible. The possibility is made interesting by two economic consequences: (1) it would immediately mitigate ingestion scaling problems by moving workload from a few repository employees to the much larger population of authors and editors; and (2) it would replace costs to repositories—costs that would be explicitly visible in the budgets of perennially underfunded central institutions—by costs that are hidden as small parts of preparing publications and borne by parties that have beforehand accepted document preparation costs.
2C: Detecting Trustworthiness and Information Quality: Belief in the integrity and authenticity of digital entities underpins their possible reuse and the weight that they will be given by eventual users, whether human or machine. Tools are needed to enable future users of digital entities to determine whether they have these qualities.
We agree.
2D: Scalability: With a few exceptions preservation research to date has involved work with small sets of digital entities. As a result, the costs and efforts associated with larger collections have not been effectively benchmarked. At the other extreme, can we develop inexpensive preservation tools and technologies that individuals without extensive archival or IT skills can readily use?
Except as already provided for in recommendation 2B, we disagree with this as a research need because private-sector digital libraries have long included very large information collections. Perhaps the problem is that “the preservation research to date” that “has involved work with small sets of digital entities” alludes to a few investigators who have not informed themselves about private-sector practice.
2E: Collection Completeness and Anomaly Detection: Users need information about the completeness of a collection. Methods and tools for providing this data are currently lacking. Is it possible to detect when collections are incomplete? How can the completeness and closure of collections be validated as part of the accessioning process? Is it possible to differentiate between anomalies or artefacts and inherent knowledge within a collection that has not been expressed?
We feel that research resources would be better spent on other questions. Part of the reason is that the word ‘complete’ is not here well-defined, being in fact meaningful only in the context of the purposes of each collection user. Even then, any collection will contain references to objects outside itself, and perhaps also to information nowhere collected. In a deep sense, we believe that no collection will ever be complete.
Furthermore, the recommendation is not peculiar to digital preservation, or even to digital libraries. The definition of an adequate body of writings must always be a professional judgement in the social context of some cultural community, not a preservationist’s guess.
Finally, questions of “inherent knowledge within a collection” raise deep epistemological and discipline-specific issues far outside the bounds of librarianship.
2F: Distributed and Grid Storage: Newer storage strategies offer the potential to reduce risk through the distribution of content across a network of devices. What impact does storage of this kind have on the naming,[59] management, discovery and delivery of digital resources?
This should be excluded because it is not peculiar to digital preservation, and because the issues are surely addressed by funded or proposed grid storage R&D.[60]
3. Preservation of Systems and Technology
3A: Formats of Digital Entities: Digital entities consist of complex objects including audio, moving image material, data held in databases, and Web pages. Insufficient research has been directed at developing preservation strategies and standards for emerging digital formats, such as digital audio, digital video, models and simulations. Projects are needed that address the specific characteristics of a wide variety of formats beyond text, data and images.
We agree, and point out that the ‘research’ required might better be characterized as software development. Any file format can be made perpetually accessible by translation based on a Turing-complete virtual machine[61] if the rendering programs are accessible for preservation.[62]
3B: Managing Complex and Dynamic Digital Entities: Many digital entities are dynamic, meaning that they change as a result of adding new data or interacting with other digital entities. For example, dynamic documents are increasingly dependent upon data that might have variable instantiations and be held in databases and spreadsheets. There has been little research on the methods, tools or technologies needed to preserve these types of dynamic entities.
Since dynamic objects (in the sense we believe intended) present no problems beyond those inherent in preservation of other digital objects,[63] new research is unnecessary.
3C: Automated Metadata Creation: The creation of metadata consumes substantial human resources, but it is widely acknowledged to be crucial to the long-term preservation of digital entities. How can the creation and authoring of metadata be automated?
This topic deserves attention. The underlying question is, “Which portions of conventional metadata are susceptible of automation, and which portions inherently depend on subjective human judgement?” Within the subjective portions, there might be social conventions peculiar to different cultural communities. The clerical portions susceptible to automation include binding to authenticity evidence.[64]
3D: Long-term Metadata Viability: To date emphasis has been on the definition of metadata elements, but there has been limited evaluation of the effectiveness or cost of metadata for managing digital entities over time. We need research that demonstrates the value of metadata for specific purposes and the minimum amount of metadata necessary. Tools are needed to track the provenance of metadata schema, for version control, and for navigation between current schema and the schema used when the digital entity was created.
Several years of debate about similar questions relative to the Dublin Core metadata definition do not seem to have achieved community consensus or much changed the convention from what was suggested in its early days.
3E: Multilingual Entities and Technology: Research in the area of digital preservation has barely paid lip service to the challenges poses by multilingualism. This is not just in terms of the digital entities themselves, but also in terms of the underlying metadata, applications, documentation and user interfaces.
DDQ has nothing to say about this topic.
3F: Acceptable Loss: Under many circumstances it will not be feasible either technically or financially to retain all the functionality of digital entities and their underlying technologies. A certain amount of loss of functionality, context and meaning is to be expected. We need methods to assess the impact of preservation strategies on information loss and to inform future generations about any known information loss.
In principle, whatever information representation has been collected in digital form can be preserved without loss.[65] Analog information can be preserved in digital form with noise-related losses that can be statistically characterized.
If one accepts as a requirement for any proposed preservation methodology that it should include methods whereby eventual recipients can assess likely information corruption, it becomes unnecessary to include 3F as a distinct research recommendation.
3G: Repurposing: eContent industries are recognised as fundamental to the emergence of new industries and economic development in the 21st century. The process of repurposing is generally manual, poorly understood, and not always responsive to emerging markets, which often result from the unanticipated re-use of intellectual capital.
This topic seems far beyond the scope of digital preservation.
WG conclusions: The Working Group agreed that three research areas were likely to have the greatest impact:
· self-contextualising objects;
· metadata and the evolution of ontologies, and
· mechanisms for preservation of complex and dynamic objects.
If the research options outlined in Sections 1 to 3 were to be prioritised these three should be rated highest. In the context of the value of digital assets to society’s memory and heritage, its intellectual capital preservation and its future economic growth, the Working Group concluded that if we invested in focused research now we would reduce the financial impact likely to be posed by our need to access digital entities in the future and would provide an environment to promote new content-driven and creative industries.
A year of reflection on these topics has us questioning what is meant by the phrases used and, if we indeed understand them, how strongly we still believe what is intended.
“Self-contextualizing objects”: no information object (digital or not) provides more than a small fraction of the context required to understand it or its social and historical context. Objects can include references to other objects that provide their most immediate contextualization. Considered recursively, this can provide information approaching a complete context (here, ‘complete’ is problematical). Including any such references in the form of a hypertext link is easily and inexpensively accomplished.[66]
The choices of such references are authorship issues that each content provider must make for himself, knowing that the work will be judged well-informed only if it conforms to social expectations for its genre. On such matters, no information science or computer science research will have much influence on other academic or practical communities. The the recommendation is therefore ill-advised.
“Metadata and the evolution of ontologies”: any ontology is the working tool of an intellectual community that will be intolerant of outside interference. If the discipline is actively progressing, the ontology will indeed evolve. In fact, the evolution of ontology is strong evidence of discipline progress. We cannot imagine what information science or librarianship can usefully bring to the topic, apart from retrospective observation of how particular ontologies have evolved.
The only pertinent preservation R&D would be creation of tools making it easy for scholars to record their ontologies and to bind their other works to them.
“Mechanisms for preservation of complex and dynamic
objects” are addressed in 3A and 3B above.
Absent more specifics than included there, these words add nothing. This is because, in a certain sense, every
digital object is complex, and because dynamic objects create no challenges not
inherent in other objects.
[48] Here
‘persistent’ is redundant. In computer
science and software engineering, ‘persistent’ is used to describe data that
outlive the processes that create and manipulate them, in contrast to data that
vanish or become inaccessible when the related processes terminate.
[49] Here
‘models’ should perhaps be replaced by ‘designs’ or ‘implementations’.
[50] See DDQ 3(2),
Figure 2.
[51] We
built this ability—allowing a service to scale from very small to very large
without forcing modification of any higher level customization software—into
the 1993 IBM Digital Library offering.
It supported a digital library service hosted by a single PC—a digital
library service whose functionality was identical when hosted on IBM’s largest
server configurations.
[52] As
far as we know, no-one has estimated collection sizes wanted, partly because
document selection criteria have not been communicated for any funded
collection project, and partly because the human time and effort needed to
ingest each holding has not been estimated.
The latter cannot be estimated until specific economical designs for
ingestion have been proposed.
[53] For
instance, I am more concerned about the safety of Mars probe data than about
that of 20-year-old Landsat data.
[54] Bob
Supnik's The Computer History
Simulation Project discusses virtual computing as a basis for
preservation: "The Computer
History Simulation Project is a loose Internet-based collective of people
interested in restoring historically significant computer hardware and software
systems by simulation. The goal of the project is to create highly portable
system simulators and to publish them as freeware on the Internet, with freely
available copies of significant or representative software..."
[55] For instance, IBM Research has been
investigating holographic storage for about a decade.
[56] This
is not necessarily in order to impede competition. A consequence of publishing an internal specification of
production software is possibly-unwanted long-term obligations to fix the
interface, losing desirable implementation flexibility for later releases and
often incurring very large additional maintenance expenses.
[57] This
seemingly calls for work closely related to program verification. See the
citations above.
Specification of programs, and testing
their correctness is an exceeding difficult topic. Although it has been carefully studied since the beginning of
digital computing, it is still far from practical applications. See, for instance, Jones, Cliff
B. The Early Search for Tractable
Ways of Reasoning about Programs, IEEE Annals of the History of Computing
26-49, April-June 2003.
[59] ERPANET Seminar on
Persistent Identifiers: Monica Duke
reports on a two-day training seminar on persistent identifiers held by ERPANET
in Cork, Ireland over 17-18 June 2004.
[60] It
is common practice that a proposal to investigate, and later to sell, a storage
methodology includes estimates of its merits with respect to accessibility,
performance, and reliability of finding and delivering data.
[62] Essential
rendering programs, such as the code for Microsoft Word™, are often not
available for transcription to virtual machine code. This is, however, less a technical and research challenge than
one of business practices.
[64] Part
of this is discussed in Trustworthy
100-Year Digital Objects: Evidence After Every Witness is Dead.
[65] Gladney
and Lorie, loc. cit.
[66] A
reliable solution needs to be based on durably unique digital identifiers of
referenced documents, and protection against improper modification of document
references. See §3.4 of Evidence After Every
Witness is Dead.