|
Digital Document Quarterly Perspectives on Trustworthy
Information |
Volume 2, Number 2, 2Q2003 |
|
|
HMG
Consulting |
© 2003, H.M. Gladney |
Colleagues and I believe we now know a good solution for every technical challenge of long-term digital preservation, at least in principle.[1] For some kinds of data, implementations are likely to be easy. In contrast, complex data, such as real-time simulations, are likely to require sophisticated engineering. We nevertheless believe it possible to structure institutional archiving plans seamlessly, dealing correctly with easy cases in the near future, while we design tools for handling complex data.
The complete technical solution is not conceptually difficult in the sense of including any unsolved basic problem. Instead it is merely complex—consisting of many interrelated parts, with each part amenable to a well-known approach. We can explain such a solution only with a divide-and-conquer exposition in which each component is itself partitioned sufficiently to persuade any skeptical critic.
Note, however, that no preservation solution can be demonstrated to be correct by satisfying a “show me” demand today, because we cannot wait to see whether test cases survive or fail several centuries from now. All that’s feasible is to analyze every potential failure type[2] to explain how the solution precludes this failure, or else to repair shortfalls exposed.[3] To do this will be a painstaking task that is likely to be tedious and might also be lengthy.
What were the key technical challenges and where can their proposed solutions be inspected? The high level challenges and references to their solutions are:
Ø Ensuring that eventual information consumers can read or otherwise use each preserved object as completely as its producers intended. See Durable Encoding below.
Ø Ensuring that each eventual information consumer can decide whether preserved information is sufficiently trustworthy for his intended application. See the first section of DDQ 2(1).
Ø Ensuring that files (bit streams) are not destroyed or lost. See Stanford’s LOCKSS.
Ø Designing to minimize labor costs and skills required, replacing human effort by automatic procedures whenever doing so is feasible. This will be addressed in Trustworthy 100-Year Digital Documents: End User Interfaces, whose release is planned for late 2003.
Ø Empowering each information producer to package content and metadata to minimize what a professional archivist or librarian must do. This will be included in End User Interfaces (see above).
|
|
|
Does this apply to aspects of your digital preservation strategy? |
Ø Hiding the complexity of our durable encoding solution from end users by clever packaging in programmers’ tools.
Ø Demonstrating that the proposed solution handles everything that technology can handle, and does not attempt what essentially depends on human judgment. This will be addressed in Trustworthy 100-Year Digital Documents: Syntax and Semantics—Tension between Facts and Values, which we hope to have ready in August.
Ø Ensuring the communication of meaning as expressed by ontologies, doing so as to embrace evolving metadata standards. This is planned for Trustworthy 100-Year Digital Documents: What's Authentic? Essential and Accidental in Documents, planned for October.
Ø Persuading skeptics that our proposed solution is optimal in certain well-defined ways.[4] We plan to address this in Trustworthy 100-Year Digital Documents: Economic Effects on Archive Design, targeted for November.
Ø Designing implementations that maximally exploit deployed technology and that make minimal new infrastructure requirements. We will address this only if funding applied for is granted.
Most of the technology needed to accomplish these objectives is readily available today either as open-source software or as commercial-off-the-shelf (COTS) offerings. Implementations will consist mostly of integration of existing offerings to satisfy needs that vary among users that include repository institutions.
In contrast to the approach sought in most articles
addressing digital preservation, our solution makes no new requirements of
digital repository technology.[5] Nor does it depend on archival institutes
practicing business controls that include independent certification
inspections.[6] Existing content management (a.k.a.
digital library) offerings are adequate, or almost adequate.[7]
Formal publication of Trustworthy 100-Year Digital Documents articles will probably not occur in 2003. Until they are published, those whose preliminary versions are complete[8] can be obtained by sending a request to hgladney@pacbell.net.
Migration, Emulation, and Durable
Encoding
Pessimism seems to be growing regarding
completely correct and durably intelligible digital preservation.
“… the emulation-versus-migration debate has largely played itself out. Neither approach provides a sufficient, general answer to the problem of digital preservation, and it has proven largely fruitless to debate the merits of these approaches in the abstract.[9] Instead, there is growing recognition that different kinds of information captured in different ways for long-term preservation will need various kinds of support.” [Waters][10]
In this, “growing recognition … various kinds of support” seems futile, because Waters neither describes nor alludes to specific measures for “different kinds of information.” Furthermore, the statement suggests pessimism about invention of a single integrated set of measures—pessimism that we believe premature not only because inquiries are at an early stage, rather than a terminal stage, but even more because we are now asserting a complete solution.
The consensus about making complex content durably intelligible seems to admit only two possibilities:
“Signals … degrade, and not at a consistent rate, and hardware and software become obsolete. Data must therefore be transferred to new media or migrated to newer platforms, operating systems, and program applications. An alternate strategy is to emulate the original; that is, to provide a way through software to mimic the hardware on which a given system ran. Either way, each item in a digital archive requires active management.” [11] [Marcum][12]
For seven years, these two approaches have dominated information preservation discussions, almost excluding any other thinking.[13] Extensive debates have not resolved any issues, and not demonstrated that either method precludes errors. Avoiding small errors is helpful when the data being preserved represent natural language text, and essential if the data include computer programs.
That two methods fail does not demonstrate that no method will work. In fact, a 1995 Lorie idea[14]—based on defining a simple “universal virtual computer (UVC)”—will almost surely work.[15] We call an elaboration of Lorie’s idea “durable encoding,” and are working towards evaluating whether it accomplishes everything wanted, showing that practical implementations are possible, and how to package such implementations to hide their complexity from end users.
How can pessimism prevail when only a few ideas have been considered and when a promising idea has been announced, but not carefully examined? That’s hard to say, beyond conventional mutterings about peoples’ failure to look across professional boundaries and “too much consensus.” [16]
Undiscriminating use of the words ‘migration’ and ‘emulation’ may have contributed to the problem.[17] In software engineering, ‘migration’ denotes diverse procedures for copying data between storage locations and perhaps also alter it. Similarly, ‘emulation’ denotes making some machine behave like a different kind of machine, but is, in itself, silent about the machine types. That a few investigations of particular kinds of migration and/or emulation have failed is simply an insufficient basis for concluding that other techniques are not worth pursuing.[18]
What are the problems of transformative migration and preservation emulation,17 and how does durable encoding avoid them? To answer these questions would require explanations that are more technical and longer than most DDQ readers would want; we recommend that readers wanting evaluation of what we claim ask their in-house experts to inspect the report described in §Durable Encoding. Among other valuable properties, durable encoding avoids the expensive “active management” alluded to by [Marcum].12
Preserving Dynamic Behavior (or Content)
Recent literature reveals surprising
confusions about the authenticity of preserved dynamic information.
“Professor Duranti ... went on: ‘But
the reason for the InterPARES project 2 is that we are discovering that by stabilising records that, by their nature, are dynamic we, in fact, end up
forging them. That is, we are eliminating their
authenticity.’
“…
“After
further lengthy discussion on varying requirements for the integrity of
different digital objects, Hans Hofman suggested that from users’ perspectives
the question was simply one of trust.
Professor Duranti agreed but warned against archives’ past faith in
creators. She said: ‘This is no longer
true. The person who generates the material may trust it and might be wrong.
Because, with digital records, the fluidity of the record is such
that if you don’t have very detailed methods of control in place all along, so that you
can say that you have a trusted system, it doesn’t work.’” [Steemson][19]
This confounds aspects that can and should be treated as distinct, including at least:
Maitland recently circulated observations about new art forms and requested comments.[20] Her concerns included some (column 1 of Table 1) that seem to follow from the confusions [Steemson] reports.
Table 1: Observations about preserving ephemeral art
|
Maitland’s
observations (excerpted) |
Reactions
to the observations |
|
Existing boundaries between artistic disciplines which have hitherto been distinct have been eroded by the … complexities of new hybrid forms |
These “existing boundaries” are new distinctions that can safely be ignored. For instance, opera is an art form that has mostly ignored the boundaries alluded to. |
|
… information technology as a central part of the creative process has led to a higher degree of interactivity and practical engagement [by] the viewer/audience, [leading] to a fundamental change in our understanding of the notion of "artistic integrity" |
This introduces no new technical requirement. Of course, copyright law and priviledges intrude, creating tensions that only law courts and legislatures can resolve. |
|
There exists a tension (which is exacerbated within the new forms focused on here) between notions of permanence in art/its place (sic) in our cultural heritage and the idea of some pieces being entirely transient and ephemeral, with no existence or future beyond the temporal or spatial boundaries within which they occur |
If a work is truly “entirely transient and ephemeral”, by definition it cannot be preserved. I.e., there is no practical conflict, because the sentence is nonsensical[21] if the conventional meanings of “transient and ephermeral” and “permanence” are intended. Part of the issue is, “Who says that this is the way it has to be?” We can choose to record[22] an artistic performance, saying, “This is not truly ephemeral.” |
What’s the underlying problem? These excerpts express uncertainty with dynamic digital information apparently because of difficulties with progressions in time. In engineering parlance, a repeat R(t) of an original performance P(t) would be called authentic if it were a faithful copy except for a constant time-shift, tstart, i.e., if R(t)=P(t-tstart). This conforms to ‘variable instantiations’ in:
“…dynamic documents dependent upon data that might have variable instantiations and be held in databases and spreadsheets.” [Ross][23]
The trick is simply to choose some instance or some sequence of instances to preserve. This works for any kind of signal or real-world situation. Its meaning is simpler for digital documents than for analog recordings or for live performances because digital states are static most of the time, whereas we think of real world performances as being continuous in time.[24]
In casual conversation, we often say that a recording copy is authentic if it is “close enough to the original.” But consider, for instance, an orchestral performance and how signals flow from its musical instruments, with wall reflections to imperfect microphones, followed by deliberate and accidental changes in studio electronic circuits, and so on, until we finally hear it reproduced in our homes. We cannot say with objective certainty which of many different signal versions is “the original”. We can do no better than choosing some particular version[25] that we describe carefully (we call that description “technical and provenance metadata”), and judging authenticity by comparison with that version. There might be no circumstance or object type for which “the original” has an unambiguously objective meaning!
The difficulty with “the original” illustrated above is conceptual, rather than being caused by technology use. It would occur for most works even if the signal channels were perfect, because no author or artist creates much by any single action.[26] Leppard, the noted British conductor, illustrates this in a brief history of Gluck’s Orfeo et Euridice.[27] An effective coping strategy is to judge an object’s authenticity in terms of its state at the event of passing it between some donating and some receiving custodian.
The DDQ 1(4) announcement of a preliminary
version of Trustworthy 100-Year Digital Objects: Durable Encoding for
When It’s Too Late to Ask proved, in retrospect, slightly premature. Its request for critical comments was
answered by a close colleague, Peter Lucas.
Peter pointed out that its program compilation model was not merely
simple, but also simplistic, with the consequence that our proposal for
preserving complex programs would be insufficient, especially when source code
was inaccessible to whoever was preparing information for preservation.
The model is now corrected, leading to a crisper distinction between the data types for which preservation methodology is adequately specified—ordinary static data files and the program class called filters—and more complex cases for which a practical engineering approach still needs to be worked out.
Another reader suggested that both Durable Encoding … and the companion Trustworthy 100-Year Digital Objects: Evidence Even After Every Witness is Dead8 were written for software engineers and for computer scientists who might want to check the validity of their proposed preservation solution, but were more technical than would appeal to managers and administrators. This is a fair objection; we do believe that managers and administrators charged with starting digital archives should decide whether we are in fact addressing problems pertinent to their institutional requirements; if so, they should request a critical look by technology experts whose judgment they trust.[28]
The core ideas of Durable Encoding … include:
Ø Using as a basis a few relatively simple and broadly accepted EDP standards, such as the ISO Unicode and UTF-8 character encoding standards, but depending only on standards that we are confident will survive and be correctly handled in centuries to come.
Ø Avoiding including in the preserved data anything irrelevant to the information being conveyed.[29]
Ø Using XML packaging to convey metadata and structure relating whatever number of bit-streams are needed to convey the main content of each digital object.
Ø Using a Turing-equivalent virtual computer[30] to encode bit-streams for which the aforementioned standards are insufficiently expressive. In particular, computer programs cannot be reliably preserved without this device.
A graph in the “How Quickly is Technology Changing?” section of DDQ 1(1) reminded readers that persistent storage cost/effectiveness had improved exponentially since 1990 (at the rate of ~28% p.a. for the price/Mbyte), and that trade experts predicted that such improvements would probably continue for another decade. The impact might be easier understood from the following equivalent statement: for a home computer, $100 will today buy approximately 100 gigabytes of HDD storage, but is expected in 2013 to buy more than 1 terabyte—enough for about a million good quality digital photographs.
Below, we identify current evidence supporting such projections. Their significance to digital preservation is that, as whenever an important parameter changes by an order of magnitude, enterprise strategies must be changed—perhaps drastically—if optimality is wanted. Such changes are likely to affect how enterprises are organized and how they relate to other enterprises.[31]
Two decades ago, data center disaster
recovery recommendations included the suggestion that duplicates of tape
library holdings should be shipped by van to remote locations. A Microsoft Research team suggests a
currently economical replacement: configure
a PC with a terabyte of storage carrying the backup copies, and ship the entire
PC between locations by parcel post.[32]
A
variant of this is that CDs may be close to replacing floppy disks. We all
receive new software and promotional material on CDs. Perhaps you already share content by an ordinary PC user’s
equivalent of the TeraScale Internet—burning and snail-mailing a
CD. Although test marketing PC’s
without floppy drives apparently suggests that such machines are premature,
such a change cannot be long in coming, given that you can have a CD-RW drive
for $25 and 700 Mbyte blank disks for about $0.15 apiece.
Both magnetic and optical disks are essentially 2-dimensional storage devices.[33] 3-dimensional storage offers dramatic improvement in device capacity (for a given device size and weight); the technique receiving attention is holographic storage.[34] InPhase TechnologiesÔ has developed a prototype holographic video storage device it calls TapestryÔ. Tapestry can store 100 Gb—equal to about 20 compressed feature films—on a DVD-like disc. InPhase projects 1.3 terabytes on a single disc.
A relatively unknown effect of magnetic disk improvements is that IBM has started to leave the business, probably for similar reasons to those that motivated its exit from most of the computer printer business a decade ago.[35] The core of IBM’s business is high expertise in technical and business matters—expertise that commands high gross margins, rather than mass manufacturing methods and consumer marketplace infrastructure. The latter require different organization and different skills than are common in IBM.
Specifically, what IBM did was sell a
controlling interest in its HDD business[36] to
Hitachi Corp. about a year ago; Hitachi manages the joint venture on which the
two corporations agreed. This
arrangement might itself be temporary, lasting only until IBM finds a buyer for
its minority interest.
Managing Criticism by Investment
Suppose that, working at a reasonable pace,
you (or the (digital archiving) organization that you manage) cannot accomplish
everything that people expect of you.
How would you handle the situation?

What people usually do in this common circumstance is try harder, often by working overtime. Doing so is often not the best tactic.
Consider your underlying objectives. These surely included one that is seldom voiced—avoiding your employer’s, your stockholders’, or your spouse’s disappointment, criticism, or verbal abuse. If so, it’s worth considering how you might reduce or even eliminate verbal abuse.
What determines the level and vigor of abuse? The shortfall between what you can accomplish and what would be accepted as satisfactory performance is certainly a factor, but the critic’s energy and enthusiasm for complaining might also be important. The following graph, which applies to any stage in a project that takes some time to accomplish, suggests the form of the relationship between complaint levels and what you are actually accomplishing.

If you are accomplishing the job with sufficient quality (right end of the graph), your customer will not criticize your performance. If you are far short of what’s expected (left end of the graph), you will be criticized, but the level of criticism will depend more on the energy of the critic than on the shortfall in accomplishing what’s needed. In such circumstances, trying harder will not reduce the level of abuse. That’s possible only if your performance is in the reverse-s-shaped portion of the abuse curve.
The trick is to notice that, if the shortfall
is large, producing less will also not increase the abuse level! Consider dropping back on the job at hand,
in order to free time/resources for developing skills and tools that make
you more effective—making it possible in the future to do the kind of work
at hand with sufficiently less energy so that you can close similar
shortfalls. It’s called investment!
Of course, you must hide that you are using such tactics from the critic, or you might discover him willing and capable of increasing the abuse you receive beyond what you thought possible. Or else, he might switch from abuse to punishment! [37]
It’s on the Web, so it Must
Be True!
Colleagues
have mentioned how difficult it is to teach undergraduates to use the
literature, and that many copy what they find into their writing assignments
without acknowledgement and without evaluating whether what they accept is
true. This practice exposes itself when
several use the same material in a class assignment, possibly because they
submitted the same query to the same Web search engine.
“… many students have difficulty recognizing trustworthy sources, though perhaps the underlying problem is a lack of understanding of the Internet as an unmonitored source of information. …
“Students are also not consistently able to differentiate between advertising and fact. Many responses to [a survey] mentioned that as the Web site was just trying to sell a product, its claims could not be readily believed. However, many of these same students immediately believed claims made by Microsoft on its commercial Web site. …
“The very small number of students who double-checked information is also concerning. … Students in this study seemed to have a great deal of confidence in their abilities to distinguish the good sites from the bad. Colleges themselves often encourage this attitude [by how they] help students …” [Graham][38]
“… the reality of the situation we currently face. At this time, technologies frequently are designed and developed more for the benefit of vendors than for users, and persons concerned with digital preservation are expected to jump through whatever hoops are required by those technologies.” [Granger][39]
This is a bizarre utterance from an author who otherwise seems to believe that “collaboration structures” should include commercial institutions. It has the ring of left-wing political rhetoric, in contrast to my comparatively boring interpretation of two decades observing the internal workings of IBM’s marketing and development teams for content management and database products.[40]
Consider “persons concerned with digital preservation are expected to jump through whatever hoops are required”. The product managers that decide R&D investments simply ignore this community, because it does not present itself as part of the market they are charged with addressing.[41] We should expect such behavior in our private enterprise system, because these managers are appraised and rewarded primarily on achieving schedule[42] and revenue targets. What you hear said is along the lines of, “If you don’t make your numbers, nothing else counts!”
The lesson for anyone sympathetic to the [Granger] complaint is simple and direct: if you want a commercial vendor to meet your requirements, become a (potential) customer.
Ted Codd, Inventor of Relational
Database
On
April 20, Edgar F. Codd, the IBM computer scientist who created the relational
database model at the core of today’s ~$8 billion industry of storing the
world's business data, died in Florida.[43]
Before Codd's landmark research[44], it was possible to store lots of information--but analyzing it was difficult, requiring many lines of code for even simple tasks. His solution, based on simple mathematics, called for representing all database information as values in table rows and columns.
As often happens when a replacement technology threatens an established business,[45] the change was effected only with vigorous, and sometimes even rancorous, debate because peoples careers were affected.[46] OracleÔ was the first to release a successfu